Riverscapes / RaveAddIn

RAVE AddIn for ArcGIS

http://rave.riverscapes.xyz/

GNU General Public License v3.0

1 stars 3 forks source link

Project References #118

Closed joewheaton closed 3 years ago

joewheaton commented 3 years ago

The Problem

There are many examples of projects that use inputs that derive from other outputs or intermediates of other projects. For example, a GUT project and GCD project both rely on DEMs as their inputs, but many times those DEMs come from a Champ Topo Survey project. Or, we drive Confinement, River Styles , RCAT, and BRAT projects all off a valley bottom polygon that can (does not have to) come from a VBET output. I don't want to duplicate the entire project. I just want the layer(s) that is necessary for the project of interest (i.e. one making reference) to have the file it needs to operate, but have the added context of what other Riverscapes Project it came from.

The Feature Request

Change

Upgrade the Riverscapes Project file to accept a new node of project references. This can refer only to other projects that exist in the warehouse. It still has the geometry or input file it needs locally (i.e. on disk), but brings across the Project Name, Type, Metadata tags, and stable URL to warehouse location.
Upgrade the RAVE add-in to recognize project references, and allow a right-click context that:
- Shows Project reference and links to warehouse
- Allows opening (eventually) of project in RAVE by downloading.

joewheaton commented 3 years ago

@philipbaileynar and I discussed this at length. @philipbaileynar is worried this is a much bigger request with fragility and tech debt. I think it is much simpler. The truth is somewhere in between. @philipbaileynar suggested I show what I mean by the XML project file.

Example - What we do Now:

I opened up a BRAT project (101900002) and focused on this concept for the VBET:

Scroll down to realizations

    <Realizations>
        <BRAT id="BRAT1" dateCreated="2021-01-31T19:13:03.747399" guid="5796d944-63f8-11eb-bc14-0a58a9feac02" productVersion="4.2.1">
            <Name>BRAT for HUC 10190002</Name>
            <Inputs>
....
                <Geopackage guid="503fe52e-a7f7-41a0-9114-15e6fba096cf" id="INPUTS">
                    <Name>Confinement</Name>
                    <Path>inputs/inputs.gpkg</Path>
                    <Layers>

Then down to the VBET layer in that inptus.gpkg. This is what it looks like now a. A vector with two tags (<Name> and <Path>) and two attributes (a guid and an id).

                        <Vector guid="50495cab-a127-401a-9161-2d34b331741d" id="VALLEY_BOTTOM">
                            <Name>Valley Bottom</Name>
                            <Path>valley_bottom</Path>
                        </Vector>

Suggested Additions to Project XML

Now I am always a little shaky on whether to do things as new tags or attributes. What I want to know is:

IFF (not required, only if it was the case) that Valley Bottom polygon was not a random shape, but was from one of our Riverscape Projects, which project it was (i.e. in this case that it came from https://data.riverscapes.xyz/#/Anabranch/fb02c25c-4bf8-4f81-bfe3-ec8553c11b8f)
Which layer it was in there (I think, but I might be wrong that this is the path)?

But what if we made a project references something like the following:

Functionality I want

I just want to know what things came from a riverscapes project we also have in the warehouse. I want to be able to chase the bread crumbs back. There are three potential ways I imagine the user to do this (in order I care about):

Right Click on Layer in RAVE Project Explorer and see new 'Layer Metadata' command (accessed therein)
Inside Riverscape Project Reports under that layer's node.
From within the navigable structure of project in warehouse

I'm imagining the following:

You could make it prettier or fancier. All I want is ability to click on URL and go to warehouse. Here's a video

My Assumptions

Just a URL - I am assuming that the main way to make this reference is just the warehouse project URL.
Their breadcrumb problem - We don't do anything fancy on downloading project for them. We just point them to where they can get it (i.e. warehouse).
Duplication I do not want any change to the core functionality of any of the tools or models in which we place these project references. RCAT and BRAT and Confinement, still all need a valley bottom polygon to run. So that project should be present in that project (even if that means duplicating from some other place). I am simply asking for the optional metadata to specify that we got it from this project. These are optional tags or attributes for any layer in any project.
Reference Integrity - I do not think we should make any effort to test or maintain the integrity of the project reference. This is the problem of the person, user, developer, curator, to make sure they provide correct URL. If the project changes or disappears in warehouse, big deal. People are used to occasional broken links.

MattReimer commented 3 years ago

Ok, here's an idea we can build on. Even as I type this I'm not sure how I feel about it but it's a start.

Let's take a real example: a typical VBET project with a slope raster that needs referencing

<VBET id="VBET" dateCreated="2021-01-30T01:18:15.127300" guid="06f696ba-6299-11eb-9b59-0a58a9feac02" productVersion="0.3.2">
  <Name>VBET for HUC 16050302</Name>
  <Inputs>
    <Raster guid="4c94ec16-f607-4de3-b793-26dc4ccc9fd9" id="SLOPE_RASTER">
      <Name>Slope Raster</Name>
      <Path>inputs/slope.tif</Path>
      <MetaData>
        <Meta name="srcType">RSContext</Meta>
        <Meta name="srcGUID">fea64c39-f172-42fa-8dc6-12e22409f473</Meta>
        <Meta name="srcWarehouseGUID">953c4550-585b-493c-ae5b-5108f263ed67</Meta>
      </MetaData>
    </Raster>

You can see we're introducing some <Meta> keys like srcType that have specific meanings. If Rave finds these it will know what to do with them. If they aren't there then nothing happens.

Let's look at them in detail:

srcType the machine name of the project type this was derived from: RSContext, VBET, BRAT etc.
srcWarehouseGUID This is analogous to the URL you were asking for @joe. It's the guid at the end of the URL on the data warehouse website.
srcGUID The guid from the top of the actual RSContext project.rs.xml file. We include this mainly for debugging purpose. The warehouseGuid does not uniquely identify a project in time.

We can add other tags too like the version of RSContext that was used, the date it was run etc.

My preference would be to avoid storing the whole URL in the project file and instead we build it when we load the project in rave by concatenating https://data.riverscapes.xyz/#/Anabranch/ + <srcWarehouseGUID>

Potential Implementation

1. Python `riverscapes-tools`

This isn't too bad and it relates to a ticket I had started . We don't want any of this meta lookup stuff in VBET. We want a meta tool for meta data. Something that can understand the progression of these tools and fill in metadata intelligently when cybercastor runs.

2. Cybercastor

Nothing to do here. Cybercastor is stupid and it will remain stupid, running whatever we tell it too and not caring a bit.

3. Rave

All rave needs is an ability to recognize specific meta keys and render things like urls intelligently.

Potential caveats

Cybercastor replaces projects. This means the warehouseId is kept but the srcGuid changes.
There is a chance that a project's source project no longer exists in the system. Is finding a bunch of broken links a deal breaker? How do we mitigate that?

Resources, existing tickets

Improved Metadata for downstream projects #80: The ticket I wrote earlier trying to address my frustrations with data versioning.
Meta Data Type #10 This is simply adding a type attribute onto the <Meta> xml tag to help Rave figure out what to do with it.
Cybercastor Runner Script (in riverscapes-tools)

joewheaton commented 3 years ago

I really like this @MattReimer. The example makes enough sense that even I could do it. All the implementation stuff makes sense to me too.

Re the caveats. In the short and long term I don't care if we reference some broken links here and there. Especially if they are from user contributed projects. In long term our cyber castor derivatives we sell should reference stable projects. In long-term, once the suite of projects built from production grade tools settles into what we could consider data version >=1.x instead of 0.? Beta, we should figure out a way to not replace.

Thanks for great suggestions. @philipbaileynar what worry you?

philipbaileynar commented 3 years ago

I really like @MattReimer metadata tags, their names and how they are used. The challenge will be how on earth we get hold of the GUIDs in question. The VBET tool knows nothing about riverscapes context. It doesn't know the DEM it was passed as an argument is part of a riverscapes project. It doesn't know the XML node or GUID or anything. We have decoupled all our tools from upstream project XML.

Option 1 - Separate "Enricher" Software

@MattReimer you said:

We want a meta tool for meta data.

Are you suggesting a completely separate piece of code that runs after a riverscapes tool has finished? Its sole purpose would be to enrich the output project XML with the metadata tags you propose, taking the information from the input XML(s). My term for this is "project enricher". Here's the command line for a "BRAT project enricher".

python enricher <rs_context_project_xml_path> <vbet_project_xml_path> <brat_project_xml_path>

The enricher would know which datasets are used from the first two projects and take their layer metadata and inject into the latter project xml.

Note that all our tools use different layer combinations from upstream tools and therefore need their own enricher logic. i.e. an enricher for BRAT. An enricher for VBET. An enricher for RVD etc.
How does an enricher know the specific layers used. What if we run BRAT with VBET50 one day and VBET90 the next. How will it know which layers metadata to use?
Who will maintain the thousands of lines of logic that these enrichers accrue?

Option 2 - Project XML files as arguments

We restructure all riverscapes tools to take upstream project XML paths as their arguments instead of the individual dataset paths. i.e BRAT would take a riverscapes context and VBET project path instead of a DEM raster, flow line network and valley bottom etc.

This is weeks and weeks of unfunded work that will destabilize every tool. It also means that none of us can run the tools without first creating whatever upstream projects are needed as inputs. By contrast, today you can run VBET with any DEM, you don't need an entire riverscapes context project that contains a DEM.

Option 3 - Append Project XML Paths as arguments

Blending option 1 and 2... we could extend the command line arguments for all tools with optional arguments for each upstream project XML file. If the tool is supplied with these paths then it knows how to take each command line dataset argument and find its node in the upstream project XML (reverse path lookup), take the metadata for this node and inject it into the output project XML file.

Option 4 - MD5 Magic

What if there's some clever daemon running in the warehouse that is constantly checking MD5 hashes that uniquely identify each dataset. With some fancy logic it could relate two projects sharing a dataset with the same MD5 hash and relate the two together. This won't work of course now that we use GeoPackages and burry multiple datasets inside a single file.

Conclusions

Matt's tags capture the information we need.
All these options represent a lot of work.
All of these options introduce thousands of lines of code that will have bugs and need maintaining.
In the last 7 days we have regenerated pretty much all riverscapes projects in existence. Matt's srcWarehouseGUID will still be relevant, but essentially 100% of the srcGUIDs will be invalid and point to nothing.

MattReimer commented 3 years ago

I think the Cybercastor Runner Script (in riverscapes-tools) ticket actually handles all of this well.

The goal of that tool is to handle all the external context and we leave the tools alone to just do what they do.

https://github.com/Riverscapes/riverscapes-tools/issues/197

philipbaileynar commented 3 years ago

The cybercastor runner can inject sub tags inside individual layers within a project?

I get that it can take some command line meta and stuff it into the project node of the output project file.

I just need to be educated how this runner script is going to get the guids of the vbet50 layer from a vbet project and inject them into an output brat project.

philipbaileynar commented 3 years ago

@joewheaton has approved this work for the Feb release.

philipbaileynar commented 3 years ago

Here are the RAVE features that are funded:

RAVE will be enhanced to show these metadata “as is” by repurposing the project metadata form. The metadata will be shown raw, with no formatting whatsoever.
RAVE will be enhanced with a new right click menu option for datasets that have an upstream project that opens a web browser at the URL for the relevant project. a. Note users might not have access to the program in question, in which case they will be redirected to the warehouse home screen. b. Note the linkage is only to the project and not the individual dataset! c. Note that the upstream project might have gotten deleted and not exist any more, in which case the user will be redirected to the warehouse home screen. d. Note that the upstream project might have been re-run in the interim and that the dataset in question within the latest version of the upstream project might be a newer version than the one in the project that the user is looking at in RAVE. e. The right click menu in RAVE will be greyed out if there’s no upstream metadata.

philipbaileynar commented 3 years ago

Note this is related to #27

philipbaileynar commented 3 years ago

Version 2.1.0 now allows:

the viewing of all metadata stored inside any layer in a project.rs.xml by reusing the project metadata form.
If the layer metadata possess the special tags to refer to a riverscapes project in a riverscapes data warehouse then user can right click on the layer and launch a web browser at the project details for the source project.

Riverscapes / RaveAddIn

Project References #118

The Problem

The Feature Request

Example - What we do Now:

Suggested Additions to Project XML

Functionality I want

I'm imagining the following:

My Assumptions

Potential Implementation

1. Python riverscapes-tools

2. Cybercastor

3. Rave

Potential caveats

Resources, existing tickets

Option 1 - Separate "Enricher" Software

Option 2 - Project XML files as arguments

Option 3 - Append Project XML Paths as arguments

Option 4 - MD5 Magic

Conclusions

1. Python `riverscapes-tools`