GRSciColl - Collection descriptors example

ManonGros commented 5 months ago

In the context of exploring the implementation of the roadmap item 2: https://scientific-collections.gbif.org/road-map#2-support-structured-collection-descriptors, here are some actual data which would feed into the collection descriptors.

The fields that are mapped to a Darwin Core or Latimer core term are prefaced with dwc: or ltc: (for example dwc:country). I didn't map everything but tried to get a bit of everything across the examples.

GRSciColl entry	source of the data	raw tables	data mapped to Latimer Core and Darwin Core	notes
https://scientific-collections.gbif.org/collection/151c91ba-a521-4735-8c11-5abf8db7fb67	Index Herbariorum	IH_raw_151c91ba-a521-4735-8c11-5abf8db7fb67.csv	IH_dwcltc_151c91ba-a521-4735-8c11-5abf8db7fb67.csv	I didn't map some of the quantitative values `Num. Databased` and `Num. Imaged`*
https://scientific-collections.gbif.org/collection/a717e77c-ea99-4d81-83ff-81931e753ffc	Index Herbariorum	IH_raw_a717e77c-ea99-4d81-83ff-81931e753ffc.csv	IH_dwcltc_a717e77c-ea99-4d81-83ff-81931e753ffc.csv
https://scientific-collections.gbif.org/collection/a717e77c-ea99-4d81-83ff-81931e753ffc	GBIF Dataset metadata	dataset_a717e77c-ea99-4d81-83ff-81931e753ffc.csv	dataset_dwcltc_a717e77c-ea99-4d81-83ff-81931e753ffc.csv
https://scientific-collections.gbif.org/collection/a717e77c-ea99-4d81-83ff-81931e753ffc	http://rnc.humboldt.org.co	rnc_raw_preservation_humbolt_a717e77c-ea99-4d81-83ff-81931e753ffc.csv	rnc_dwcltc_preservation_humbolt_a717e77c-ea99-4d81-83ff-81931e753ffc.csv	This is only one of the tables visible on the website, see other examples below. I only mapped one field, I didn't know how to map the others. Note that the preservation type would have to be interpreted to be searchable
https://scientific-collections.gbif.org/collection/a717e77c-ea99-4d81-83ff-81931e753ffc	http://rnc.humboldt.org.co	rnc_raw_types_humbolt_a717e77c-ea99-4d81-83ff-81931e753ffc.csv	rnc_dwcltc_types_humbolt_a717e77c-ea99-4d81-83ff-81931e753ffc.csv see also this alternative mapping rnc_ALTERNATIVE_dwcltc_types_humbolt_a717e77c-ea99-4d81-83ff-81931e753ffc.csv	I made two different mappings, one is closer to the original form and the other is transformed a bit
https://scientific-collections.gbif.org/collection/6eae4377-f8b4-41ac-a9c1-db5a81afde98	http://rnc.humboldt.org.co	rnc_raw_geography_6eae4377-f8b4-41ac-a9c1-db5a81afde98.csv	rnc_dwcltc_geography_6eae4377-f8b4-41ac-a9c1-db5a81afde98.csv
https://scientific-collections.gbif.org/collection/6eae4377-f8b4-41ac-a9c1-db5a81afde98	http://rnc.humboldt.org.co	rnc_raw_levels_6eae4377-f8b4-41ac-a9c1-db5a81afde98.csv	rnc_dwcltc_levels_6eae4377-f8b4-41ac-a9c1-db5a81afde98.csv	Lots of quantitative values were left unmapped *
https://scientific-collections.gbif.org/collection/6eae4377-f8b4-41ac-a9c1-db5a81afde98	http://rnc.humboldt.org.co	rnc_raw_types_6eae4377-f8b4-41ac-a9c1-db5a81afde98.csv	rnc_dwcltc_types_6eae4377-f8b4-41ac-a9c1-db5a81afde98.csv alternative mapping:
rnc_ALTERNATIVE_dwcltc_types_6eae4377-f8b4-41ac-a9c1-db5a81afde98.csv	I made two possible mappings for that one too.
https://scientific-collections.gbif.org/collection/2a8835ad-4a2e-43df-b976-f924f76fe628	SwissCollNet	swisscollnet_raw_2a8835ad-4a2e-43df-b976-f924f76fe628.csv	swisscollnet_dwcltc_2a8835ad-4a2e-43df-b976-f924f76fe628.csv I also tried some alternative mapping where I put the number of specimen in one an another one (swisscollnet_ALTERNATIVE_dwcltc_part1_2a8835ad-4a2e-43df-b976-f924f76fe628.csv) with one line per collector (swisscollnet_ALTERNATIVE_dwcltc_part2_2a8835ad-4a2e-43df-b976-f924f76fe628.csv)	Lots of values (including quantitative ones) remain unmapped*
https://scientific-collections.gbif.org/collection/3c41e738-b94e-4ed6-a9ae-f57c7baaf521	SwissCollNet	swisscollnet_raw_3c41e738-b94e-4ed6-a9ae-f57c7baaf521.csv	swisscollnet_dwcltc_3c41e738-b94e-4ed6-a9ae-f57c7baaf521.csv	Lots of values (including quantitative ones) remain unmapped*

* I left the quantitative values unmapped because I wasn't sure how best to do it. The Latimer core recommendation is to use https://tdwg.github.io/ltc/terms/#MeasurementOrFact_MeasurementOrFact with a definition for each measurement. It doesn't easily fit into those flat tables and each source has its own metrics. It would be quite difficult to combine everything (check the examples provided).

As a side note many fields that I mapped to ltc:objectClassificationName should/would be using the DISSCO topicCategory vocabulary (https://docs.google.com/document/d/19OPyOm9VF2qfI3M6RmJPvRfo8JlZ3tt0II05aGCyBHQ/edit)

MortenHofft commented 5 months ago

@ManonGros from our conversation earlier this is what I got.

A csv can have one or multiple columns. There is no mandatory columns. CSVs can be sparsely populated (one row can be completely filled, and the next one only a few of the columns).

Some of the columns will be multivalue fields (e.g. pipe separated preparations).

ManonGros commented 5 months ago

Yes that's exactly that.

For the multivalued fields, I would be in favour of supporting the ones that are already supported in the occurrence index: recordedBy but not support everything multivalue. Alternatively, we can also say we don't support multiple values and force people to upload each value in a separate row.

Essentially, this is the difference between this mapping: swisscollnet_dwcltc_2a8835ad-4a2e-43df-b976-f924f76fe628.csv and this one swisscollnet_ALTERNATIVE_dwcltc_part2_2a8835ad-4a2e-43df-b976-f924f76fe628.csv.

MortenHofft commented 5 months ago

Is scientific name and country multi value? I ask because they are interpreted and adds hierarchies. And that would make them different from the occurrence index

ManonGros commented 5 months ago

country and scientificName aren't multivalue in the occurrence index and I don't think they should be in the collection descriptors either (it would be a headache).

gbif / registry

GRSciColl - Collection descriptors example #557