Support `physical` sections in the editor

jagoldstein commented 6 years ago

When uploading data objects with a package and submitting, no physical section is created in the EML. Typically, this section would provide file size, checksum, and the online distribution URL.

Examples: https://test.arcticdata.io/#view/urn:uuid:d71babe1-3fe3-430c-913d-56eef00124b6 https://test.arcticdata.io/#view/urn:uuid:6de80a8a-f674-4b5d-b1ff-e9bbcea52634

laurenwalker commented 6 years ago

This is a feature we are planning on supporting in future versions

jeanetteclark commented 6 years ago

We spend a lot of time in the datateam making sure the physical sections are correct. This is definitely something we emphasize heavily in our training and daily work with the team. Moving forward with this version of the editor, are we expected to add these physical sections in using R? Or would we just accept that some datasets would not have them? Having to add them would obviate some of the efficiencies we are happily anticipating by switching to curating datasets using the editor as opposed to R. I welcome feedback from @mbjones @csjx and @amoeba here.

amoeba commented 6 years ago

From an operational perspective, the only reason to have physical sections was to support the old web form (Registry script) because it didn't understand what a Resource Map was. These sections are no longer used by the Editor. @laurenwalker does the PROV editor integration work off the physical section or just the Entity?

From a perspective of authoring best practices EML, the physical section is a strong requirement as its the glue that makes any assertions about what data the EML is documenting hold up (URL, checksum).

From an Editor release perspective, this is not a dealbreaker because the Editor will still work fine without the physical sections (operational) AFAIK. But I feel like we would then want to go back over Editor-2.0-created Packages and align the physical sections with the Entities described in the Resource Maps. This is an increase in work for the Data Team if we choose to do this.

Weighing the cost of delaying the Editor's release against a possible increase in work for the Data Team, I'd go with not delaying the Editor release for this feature but making sure we have a solid plan to deal with the fallout.

jeanetteclark commented 6 years ago

Thanks for the clarification Bryce. The points you make about best practices EML are why we emphasize this section so much.

Jesse and I would definitely like to have a plan for this going forward, and part of the reason why I made that comment was to make sure that our path forward is clear. We could probably write a helper function that would automatically add in physical sections for all the objects in a data package and update the EML, and incorporate that as part of our workflow in curating packages that come in.

laurenwalker commented 6 years ago

@amoeba - The Prov editor can work without the physical section, as long as there is an entity section in the EML.

gothub commented 6 years ago

@lauren what are the EML elements, in order of precedence, that are evaluated in order to link a package member (in the table listing) with an entity section in the metadata view, i.e. to enable the more info link?

laurenwalker commented 6 years ago

@gothub - This is the code that matches up the entity section with the DataONE object:

https://github.com/NCEAS/metacatui/blob/master/src/js/views/MetadataView.js#L1378-L1476

gothub commented 6 years ago

Ok, thanks. I'm thinking that it would be useful for the datateam to have a list of EML elements to include or check so that this connection between the entity section and the D1 object can be insured/controlled (otherwise prov may not display). From my first pass reading of the code, it looks like the order of precedence is

'online distribution' element in physical section
'anchor id' matches object id (i'll have to dig further to fully understand this one)
d1 object filename matches entity id

Is this accurate?

laurenwalker commented 6 years ago

With the new editor, this shouldn't be an issue since it creates the entity sections automatically. @gothub - Does the data team regularly create EML manually? I thought they just used the old registry, new editor, primarily.

gothub commented 6 years ago

Yep, good point - maybe this is a non-issue.

csjx commented 6 years ago

I actually think that while the new editor doesn't currently produce physical sections, it ought to and will in the future. While objectName and authentication are useful, those are now more universally supported in the SystemMetadata. That said, the physical/dataFormat tree is quite important for programmatically parsing fixed width, simple delimited, and complex delimited text files, as well as binary rasters. So, we plan on parsing delimited files like .tsv, .csv etc. in the editor, and providing the physical metadata needed to then load and parse these types of files into a preview-like display. So, we'll get there, and thanks for pointing this out @jagoldstein - these are all things Morpho does internally, and we want to match that on the MetacatUI side.

amoeba commented 6 years ago

Not to derail this Issue, but @csjx I take strong issue with this:

While objectName and authentication are useful, those are now more universally supported in the SystemMetadata

Authoring DataONE System Metadata and EML are orthogonal concerns, where the former is to make DataONE happy and the latter is to document the dataset. Omitting the checksum and objectName in the EML because the System Metadata already contains them would be a mistake for long-term preservation of the scientific metadata. I wonder if we really feel different on this though.

csjx commented 6 years ago

@amoeba Ah, yes - point well taken. I agree that these things should be populated in the EML, but I guess I was just trying to make the point that the physical/dataFormat tree is quite important, and that we shouldn't forego populating dataset/physical in MetacatUI. In fact, if we populate it at all, objectName is in fact required. So yeah, complete EML descriptions are key, and repeating the basic information in SystemMetadata makes us more interoperable. Thanks for pointing this out. 😄

laurenwalker commented 6 years ago

At the ADC team meeting today, we decided that this is a med-high priority issue. We need to create physical sections and update them each time an object is updated (for now, that would only the be file name since there is no "replace file" function in the editor yet).

laurenwalker commented 4 years ago

This came up again during ADC discussions, since the data team still routinely adds physical entity metadata to each submission. We should be able to easily add this metadata automatically about the info we already know about files during upload and from the system metadata.

Generation of the physical sections should be configurable in the AppModel so some repos can turn that off if they choose to.

NCEAS / metacatui

Support `physical` sections in the editor #450