SpeciesFileGroup / taxonworks

Workbench for biodiversity informatics.
http://taxonworks.org
Other
84 stars 25 forks source link

Task - DwC Importer - Macro for Data attributes #2072

Open mjy opened 3 years ago

mjy commented 3 years ago

@LocoDelAssembly we are considering this option.

As a curator I want to add data attributes as columns in a format with the header something like this:

Where composed like:

Given this do this:

LocoDelAssembly commented 3 years ago

Predicates need a definition (forcefully somewhat large) to pass validation and be created. If MicroHabitat has to be converted to Micro habitat I think it will be best to be Micro habitat from the beginning, there isn't any problem if the CSV header has spaces. It is a problem however when trying to have this in meta.xml when submitting a real DwC-A and not a text file (spaces or not), since the term name has to be a valid URI.

Would https://dwc.tdwg.org/terms/#dwc:dynamicProperties be too far from what is been tried to accomplish here? (Consider that all macro attributes would be placed in a single column)

mjy commented 3 years ago

Forgot about the definition. Good point. Therefor, new workflow: curator must have Predicates created before import. Errors raised if not detected. This should allow us to more concisely reference the Predicate, perhaps simply by Name, or to be safer some id/name column.

mjy commented 3 years ago

@LocoDelAssembly this becomes a priority for us, it's basically the last thing we need to "unlock" Brian's data.

Note slightly similarities to #1773. In this case there is no existing field, however it might work that these fields are also in the "official" TaxonWorks extension, so the two might be linked.

"Accepted" proposal is"

The <ignored user label> is strictly an annotation on the dataset, the importer ignores it. It's used so that a human looking at the file can see what they intended to put into the data.

micro-habitat sensu brian:taxonworks:data_attribute:2312
stream
under a leaf
debpaul commented 3 years ago

@mjy I might suggest that users provide labels from their end without spaces (to avoid headaches). So "micro-habitat" preferred over micro habitat

mjy commented 3 years ago

Spaces technicaly won't matter in that we will ignore everything before the ':', but yes, best practices should exclude them.

debpaul commented 3 years ago

@mjy how does the user of TW know the numeric value for the predicate?

mjy commented 3 years ago

They can currently find it through the data card for that predicate. New issue though- show it prominently in the Manage task.

LocoDelAssembly commented 3 years ago

The problem with using the column header as a macro is that there is actually not such thing as column header, when you are submitting a real DwC-A and not a TSV, the headers you see in the grid view are created by extracting the last "directory" from the path:

<field index="2" term="http://rs.tdwg.org/dwc/terms/scientificName"/>

Standard option would be to encode the data in dynamicProperties (but would be all in a single column), or accept TW-namespaced terms, and make the importer remove most of it when displaying as header. Something like code below?

<field index="42" term="https://taxonworks.org/dwc/macro/micro%20habitat:taxonworks:predicate:9343"/>
mjy commented 3 years ago

Thanks for clarifying. Your representation is fine, though maybe we could refine the term to simplify it.

Thought- could the importer find the label from the Predicate itself, and use that as the column header? If not found it would display "Unmapped" or "Requires Predicate" or some such. That would simplify the "IRI" term removing the label from it.

mjy commented 3 years ago

@LocoDelAssembly was thinking- perhaps we need 2 (maybe more ultimately) specific DwC extensions). TaxonWorks could define behaviour based on the extension type (which I assume can be encoded in the description).

The first extension would be only data that we want to map to DataAttributes. Each IRI would be a Predicate global_id, like gid://taxon-works/Predicate/29, maybe pre/post fixed with the class it is to be added to, unsure. gid://taxon-works/Predicate/29/CollectionObject. Or we could fork the exensions and each extension would map to one class of data.

The second/other clas of extensions would be specific to existing TaxonWorks fields.

In general, the ideas is that we encode some of how we need to process the extension in the extension type.

LocoDelAssembly commented 3 years ago

Example file with mappings to CollectionObject and CollectingEvent data attributes (InternalAttribute): dwc.zip

mjy commented 3 years ago

@LocoDelAssembly Can you say a little more about that example? The column headers, that are not column headers, are supposed to be in the file in that way for processing by us?

LocoDelAssembly commented 3 years ago

Yes, what it is normally ignored by DwC-A processors (https://tools.gbif.org/dwca-reports/084-5217764988802235791.html), TW reads the headers line (if any) and looks up for TW:DataAttribute:.* fields.

This is how it looks: image

Imported example: image image image

I don't think we going to ever be able to support this on a proper extension as it should exist beforehand and requires a fixed set of terms, there is no wildcard trickery that could be used AFAIK. If fields are added right into meta.xml core mappings using the URIs as terms, I haven't tested but I guess would work too, however it won't pass GBIF's checks.

mjy commented 3 years ago

I don't think we going to ever be able to support this on a proper extension as it should exist beforehand and requires a fixed set of terms,

Thanks for the clarification @LocoDelAssembly, technically we're there, but this feels very kludgy. Is the primary problem with using the extension that it will fail (all?) checks with GBIF? If so then maybe they need an option to skip checking extensions? If so I'm not terribly worried, but it would be clear that we should open an issue with them to allow this option as we would want to verify the core is legal, but ignore the extension.

I can imagine a second problem that the DwC handling gem doesn't allow the extension to parse? If so, same approach as above I feel, should be updated.

A third problem is perhaps formalizing our extension terms, but this could very much be done at least in some temporary place?

Ping @debpaul. The ability to write (and use) a not-yet formalized extension is critical as it acts as a means to test a proof of concept. The curent "hack" (though technically very cool) doesn't feel right. What are we missing? Do we write and just fail GBIF checks and have to live with it?

LocoDelAssembly commented 3 years ago

The gem can parse extension data (the importer currently is just saving the data but not using it), however preparing a schema like https://rs.gbif.org/extension/dwc/resource_relation.xml for TW macros I don't think it is possible, no wildcards for qualname.

Extensions of our own (not for macro, but for fixed TW attributes) would also fail GBIF tests because it won't recognize them (unless gets registered), but the archive would be valid nonetheless.

dwc-archive also supports mapping right into meta.xml, but again will be flagged as unknown terms in GBIF's validator (like with ITIS's superfamily).

mjy commented 3 years ago

dwc-archive also supports mapping right into meta.xml, but again will be flagged as unknown terms in GBIF's validator (like with ITIS's superfamily).

This feels like the right approach? What's the main downside to it? We would have to update the importer UI + parser to handle extensions?

Does dwc-archive let us get a name for each extension? I.e. from the name we can determine the action the importer should make?

mjy commented 3 months ago

Feels like this is outdated given what we have implemented, can we close? @LocoDelAssembly @debpaul

LocoDelAssembly commented 3 months ago

Looks it can be closed to me.