AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing
Other
7 stars 24 forks source link

DwC fields not being indexed #391

Closed nickdos closed 3 years ago

nickdos commented 4 years ago

See support ticket https://support.ehelp.edu.au/a/tickets/81984.

User flagged that some DwC fields do not appear in a download file but the fields can be seen on an individual record page.

EDIT: Outstanding tasks moved to #394

See https://biocache-ws.ala.org.au/ws/occurrences/search?q=data_resource_uid%3Adr342&facets=georeferenced_by,georeference_protocol,georeferenced_date,georeference_sources&pageSize=0

Only georeferenced_date shows values and this is also the only column populated for CSV downloads. All the georef* fields are marked as being indexed and stored - https://biocache.ala.org.au/fields?filter=georef*.

Investigate why these fields are not being added to the SOLR index.

charvolant commented 4 years ago

The raw fields get indexed. https://biocache-ws.ala.org.au/ws/occurrences/search?q=data_resource_uid%3Adr342&facets=raw_georeferenced_by,raw_georeference_protocol,raw_georeferenced_date,raw_georeference_sources&pageSize=0

Looking at the cassandra table, georeferencedBy_p is not being updated from georeferencedBy. However, georeferencedDate_p is.

nickdos commented 4 years ago

@charvolant user came back and said samplingProtocol also not showing up - should I create a new issue or leave it here?

Mesibov commented 4 years ago

@nickdos wrote: "User flagged that some DwC fields do not appear in a download file but the fields can be seen on an individual record page."

From 2018 paper (https://doi.org/10.3897/zookeys.751.24791)

"identifiedBy: ...The original identifiedBy_raw data item appears on the ALA webpage as “Identified by” for the record but is missing from the standard (recommended) download." "locality: ...The original locality_raw data item appears on the ALA webpage as “Locality” for the record but is missing from the standard (recommended) download."

These 2 were subsequently fixed, but was no automated check put in place to ensure that downloaded fields were the same as the databased fields, or at least not empty vs non-empty? Left it to users to spot, instead?

timhicks-ala commented 4 years ago

Additional fields to add if applicable:

These are related to iNaturalist and the community identification of a sighting. Neither of these is currently exported in any download, making it impossible to determine the community's confidence on a record's ID in any downloaded set of iNat data.

Issue raised in helpdesk ticket 84773 as I couldn't advise the user to specifically use those fields in a download to gauge accuracy of records.

ansell commented 4 years ago

https://github.com/AtlasOfLivingAustralia/biocache-service/issues/317 is still an issue even though it was closed at one point due to confusion about the nature of the bug.

The sampling protocol processed field is not consistently populated with the raw values, so downloads look odd and are missing values in the "samplingProtocol" column because of the bug.

nickdos commented 3 years ago

Not yet appearing in prod SOLR. Keeping in QA

nickdos commented 3 years ago

Facets now have values.