Post records to EDG - Githubissues

torrin47 commented 6 years ago

From @torrin47 on August 31, 2018 23:5

Once edits are complete, authenticated users will then be able to post updates directly back to the EDG using the GeoPortal Server REST API (https://github.com/Esri/geoportal-server/wiki/How-to-Publish-Resources and https://github.com/Esri/geoportal-server/wiki/How-to-Manage-and-Edit-Resources#Edit_metadata_in_an_external_editor).

_Copied from original issue: USEPA/EPA_Environmental_DatasetGateway#63

aergul commented 5 years ago

@torrin47 I think we will need to improve the authentication plumbing in EDG first before implementing this requirement as well as #27. When using the REST API to edit a document, EDG is prompting with basic authentication for a different set of credentials instead of accepting geoplatform token or credentials:

I suspect you might have additional insight on this?

torrin47 commented 5 years ago

I wish I had additional insight, but it's pretty much a black box. That said, when I first log onto the EDG, I am then able to use that URL syntax to get and put records, with no login prompt. I can only assume some kind of session cookie is being set to allow me to perform these edits. I've just made your account an EDG admin, so you should be able to log on, upload test records, and try using the API to edit records - maybe that will make a difference? Jessica is an EDG Publisher but not an admin, for comparison purposes.

aergul commented 5 years ago

I was able to access the record as admin. Reviewing docs ( https://github.com/Esri/geoportal-server/wiki/How-to-Manage-and-Edit-Resources#how-to-manage-resources ) the behavior so far appears as described.

aergul commented 5 years ago

To post records to EDG, we need to have EDG accept and check the token provided by the editor. The PortalIdentityAdapter has a facility to validate tokens. However, it makes the implicit assumption that the token is for the app running it, in this case EDG. So far we've had the editor as its own app but we may want to simply have it be part of EDG and use its app id. That way, they will be unified authentication-wise, logging into one will have the user logged into both (I'm simplifying a bit as the two currently store the token differently). If this sounds like a plan @torrin47, I will provide the whitelisted referrers and redirect URIs to enter into the ArcGIS Online account that owns the EDG app id.

If there is good reason to keep them separate, we could probably work out a mechanism for that as well.

torrin47 commented 5 years ago

Makes perfect sense if it can work that way, nice find. The requested URLs have been whitelisted, the appID is s0brwjWwE7aFPPbF. Do you need the appSecret as well?

aergul commented 5 years ago

Perfect. I should not need the app secret.

aergul commented 5 years ago

Can confirm that login against EDG works.

It appears EDG stores token inside the Java session - which is not readily accessible from the editor. Access to it is not necessary for our purposes, would enable SSO-like behavior between the EDG and the editor. Maybe a future ticket.

Proceeding to build a REST API endpoint to accept editor provided token and validate it before taking further action.

aergul commented 5 years ago

The editor can now handshake with a new EDG endpoint that returns a JSON of user data after validating the token that the editor presented. [Note this will require a separate PR on EDG to be applied to EDG prod to work properly.]

At this time, we don't do much with this data except presenting a different icon (and tooltip text) depending on the user being an EDG admin, publisher, etc.

This is a good time to think what other changes we make to the UI if the user is able to publish records @torrin47 @jzichichi . For example:

Does the submit button save to the EDG instead of emailing EDG staff?
What if EDG is temporarily unavailable?
What if there is some kind of conflict? Say, another record with a different GUID but identical title.
What if the user is simply a gptRegisteredUser?
Other?

aergul commented 5 years ago

Also, since geoportal can only store records in XML format, we will need conversion between JSON and XML/RDF in both directions. Is there a converter you are aware of/prefer @torrin47 ?

torrin47 commented 5 years ago

Great news, thanks. Some thoughts:

Does the submit button save to the EDG instead of emailing EDG staff?

That was the idea - direct editing of records - though it gets complicated if someone wishes to edit a legacy record that was harvested to the EDG. If that record is going to switch to direct-editing, it needs to be removed from the harvest source. I guess the most user-friendly approach would be to handle the save/email split on behalf of the user. If a record can be directly edited with no conflicts, submit saves to the EDG. If it was a harvested resource, it gets emailed to EDG staff so we can manually remove that record from the harvest source to avoid conflicts. The user would get a message that the record will be posted soon.

What if EDG is temporarily unavailable?

In this case it would be reasonable to send a copy of the record to EDG staff along with the error message from the EDG. EDG staff would then take responsibility for addressing the error and posting the record.

What if there is some kind of conflict? Say, another record with a different GUID but identical title.

I really like the idea of an audit check on both GUID and title - alert the user to the existence of a potential conflict, provide a link to the existing record, and request confirmation from the user that they're sure they want to proceed.

What if the user is simply a gptRegisteredUser?

Registered users don't have any publishing privileges - in theory they're allowed to view restricted records, but we don't have any restricted records in the EDG anymore, so it's a meaningless role from the EDG perspective. I guess in theory anyone with an EPA LAN ID should be able to authenticate to the editor and author a record, but if they're not a publisher or admin the submit button would only email EDG staff, it would not attempt to post to EDG.

Also, since geoportal can only store records in XML format, we will need conversion between JSON and XML/RDF in both directions. Is there a converter you are aware of/prefer @torrin47 ?

GeoPortal already has a built-in converter it uses for harvesting DCAT JSON - I believe it's this class: https://github.com/Esri/geoportal-server/blob/d641163447dc0b71b7f06db3d0c5efe135ba0a5c/geoportal/src/com/esri/gpt/framework/dcat/DcatParser.java I'm not quite sure how/whether this conversion is exposed via the API, but hopefully we'd be able to leverage it and wouldn't have to reinvent the wheel. That said, Esri's conversion code falls short on several fronts - we have an enhancement request (https://github.com/Esri/geoportal-server/issues/257) open that Esri doesn't seem particularly motivated to address. We didn't create a ticket in our own backlog because we believed Esri should fix this in their code, but two years later maybe we should just fix it ourselves?

torrin47 commented 5 years ago

No native API for publishing JSON (only accepts XML). When someone wants to edit an existing record, the edits will need to be applied both to the source DCAT JSON (for harvesting to data.gov) and the EDG, and since the former can only be done manually, there's not much benefit to be gained from posting directly to EDG. So... internal EPA users will be able to select existing records for editing, but all edits will be submitted to EDG team for manual posting... And logged in EPA users won't be prompted for a sponsor, EDG team will review and accept records from any EPA user.

torrin47 commented 5 years ago

Following this discussion, intramural submissions will be parallel to extramural submissions - submitted for email via the same endpoint, with a slightly different email template per #98 . Will automatically include full name and email address derived from ArcGIS Online Login. Should also indicate location of edited file or proposed group ownership for new record.

aergul commented 5 years ago

@torrin47 can't recall what we decided would be in the submitted file: 1- Full file with all records in it - edited or not. This may be problematic with very large repos that may not fit in local storage. 2- Valid DCAT record (headers and all) with only the edited dataset record in it. 3- Just the edited dataset record. This would not be a valid file on its own but arguably easiest to paste into the repository.

Also how would the user indicate proposed group ownership?

torrin47 commented 5 years ago

@aergul the decision was option 2. At some point I think we'll want to explore option 1 when we get to batch editing, but for now it seems most straightforward to stick with single records. And although 3 might be slightly easier to cut and paste, we've already implemented 2 for extramural researchers and I think there's value in staying consistent.

For editing existing records, group ownership would be implicit in the file selected for editing - though we need to iron out how that information gets included in the submission (ticket #98). Less clear is how we'd solicit group information for brand new records. I guess we'd need some field parallel to the "sponsor" field that's only shown to internal/logged in users that lists all of the known stewardship groups and includes a free-text option if they don't see their organization listed and want to request a new json repository.

aergul commented 5 years ago

Is it safe to say we will apply #2 to locally edited repos as well? In other words, even if the loaded file had multiple datasets in it, the file that gets downloaded (via save) or gets submitted to EDG will have only the edited dataset.

torrin47 commented 5 years ago

Yes, that seems logical and consistent. We'll handle multiple-record scenarios down the road.

aergul commented 5 years ago

Validation now ignores elements disabled based on user's login status. Document submitted with user's name and email.

torrin47 commented 5 years ago

Nice, I need to get working on #98 so we can be sure that loop is closed!

jzichichi commented 5 years ago

@aergul @torrin47 - updates to validation appeared to work great here. I am assuming the submission portion is still in progress before testing as per @torrin47 comment above so I will wait until the loop is closed to test and provide feedback.

torrin47 commented 5 years ago

Ok, finally starting on #98. Not a heavy lift, but since we're revising the submission API, I want to make sure we're being consistent (and I'm not sure we were initially).

It looks like the handler is getting 3 values from the HTTPS POST form parameters:

        metadata = form.getvalue('metadata')
        publisher = form.getvalue('publisher')
        sponsorEmail = form.getvalue('sponsor')

and then several other parameters from the metadata JSON:

        agreementType = dcat.pop('epa_agreement_type')
        agreementNumber = dcat.pop('epa_agreement_no')
        dcat.pop('epa_contact')

I don't remember the exact logic here, I think it had to do with saving/restoring the agreement type/number and sponsor in their draft record even though we strip those elements before including in the EDG. It's not a big deal either way, but the OCD in me wants to keep things separate: only standard DCAT elements in the 'metadata' json, and handle any others as POST parameters. Is this overkill? If we go this route, I guess we'd not save agreement_type, epa_agreement_no, and epa_contact in the metadata, a user would have to re-enter that information if they left the editor and returned to it? Too harsh?

For internal users, we'd presumably have only 2 non-DCAT elements - some guarantor of logged in status (username might suffice?) and the target repository. For edits to existing records, this would simply be the URL that was used to load the record (i.e. https://edg.epa.gov/data/Public/OEI/metadata/OEI-OIC.json) For new records, if we request their top level AAship as mentioned above, I guess free-text would be fine - we're not doing anything automagically with the URL, it's just a helpful indicator of our team's next steps.

So how does that affect the submit handler? I think it'd mean we'd have 1 mandatory element: metadata: (full standards-based dcat.json) and 5 optional elements: (sponsorEmail, agreementType, agreementNumber) or (epaUserName, repoURL).

Thoughts?

aergul commented 5 years ago

Now posting using the new spec: (sponsorEmail, agreementType, agreementNumber) or (epaUserName, repoURL).

Haven't done anything for new EPA record AAship just yet... We could make it a dropdown with the same options from the master list used in loading from EDG? If that's not comprehensive, we could allow the user to enter one?

torrin47 commented 5 years ago

Awesome, submitHandler is updated and it all seems to be working smoothly. I think your suggestion is right - we can discuss more tomorrow.

torrin47 commented 5 years ago

Hadn't appropriately handled input elements. fixed now.

USEPA / EPA_Non-geo_Metadata_Editor

Post records to EDG #26