Open cha801p opened 1 month ago
DwC Terms were updated in the preingestion and this was tested using dr22687. Please refer to the ticket https://github.com/AtlasOfLivingAustralia/data-management/issues/1054 for more details.
In the DAG elastic_dataset_indexing, elastic-cleanup.sh
is run. This should be the solution for the 2nd issue, duplicates.
In the most recent DwCA events archives export; event, verbatim_event and verbatim_occurrence have thehttp://rs.tdwg.org/dwc/terms/eventType
field. However, event.txt has no value in that field. See dr22687, fully processed, exported and elastic indexed with the more pipelines version in test.
Is there a data resource where the ingestion of the eventType fails?
Is it intentional that only the event.txt (and verbatim files) has eventType
? Should occurrence.txt
also have eventType
field. See the exported meta.xml for dr22687.
There is a problem running the DAG sh to delete the index before updating. At first glance it appears to be a elasticsearch network.host permission issue.
We’ve identified several issues with the handling of the eventType on events.test:
[ ] The ingestion pipelines now recognises the term
<field index="x" term="http://rs.tdwg.org/dwc/terms/eventType"/>
within themeta.xml
file. This leads to theeventType
being captured on events.test.[ ] To ensure data integrity, we must manually delete the existing Elasticsearch data using curl commands from the databox. If we skip this step, reingesting data does not completely overwrite the previous data. This can result in duplicated information remaining in the index, causing inconsistencies.
[ ] After successful data ingestion and indexing, the meta.xml file generated in the
ala-databox-avro/dwca-exports/drXXXX.zip
folder contains the term<field index="x" term="http://rs.tdwg.org/dwc/terms/eventType"/>
. , while the eventType column in the DwCA-exports is coming up empty. This means that there are still inconsistencies in the reading and processing of the term.As
eventType
has been added to dwc standard: https://dwc.tdwg.org/list/#dwc_eventType , It's required to have this change in the pipeline's code.