Open rosemaryjoconnor opened 1 year ago
iNaturalist is a good example: https://collections.ala.org.au/public/show/dr1411
Last checked date Most recent data published on
What is the best way to represent this on collectory?
Dates should probably be updated in the collectory using a DAG when the index is run.
Explore load dates in biocache. Suggest an approach to fix.
We need to define these:
Last checked: Date the fetcher was run in pre-ingestion What if there is no fetcher?
Data currency: Discuss: Load_dataset date vs Index publish date
@sadeghim @patkyn what do you think?
The first seems to be a pre-ingestion job The second is perhaps a pipelines-airflow job ?
My impression is that both are very close as we use pre-ingestion fo all loads now. And updating collectory can happen in preingestion for all DRs not matter if there is any fetcher for them or not. Since right after pre-ingestion we trigger load, then the difference will be around minutes to hours. In that sense, I think we need to have a clear definition of these fields to differentiate them otherwise they both have almost same value and one of them is redundant.
After discussion:
These fields in the collectory should be populated by a DAG preingestion: Last Checked - the last date that a process checked that data was available whether there is a new dataset to load or not, Date update via Pre-ingestion Data Currency - the last date that data was received by ALA, meaning there was a new dataset to ingest in the ALA. The date can be set via full-Index DAG Timestamp of the last dataset: (talked about this and don't need to add it to collectory as it needs implementation in the code)
Change collectory to also add these biocache queries: DR first loaded: min(first load date) - the date the data resource was first loaded into the ALA (DwCA to Verbatim AVRO) DR last loaded: max(last load date) - the date the darwin core archive was last loaded into the ALA (DwCA to Verbatim AVRO) DR last processed: max(last processed) (Pipelines)
Tasks:
DR first loaded: min(first load date) DR last loaded: max(raw_lastModifiedTime) - date verbatim AVRO written DR last processed: max(lastModifiedTime) - date interpreted AVRO written SEe https://github.com/AtlasOfLivingAustralia/biocache-service/issues/846 for detail
Last checked - lastModifiedTime (This could be obtained from https://biocache.ala.org.au/ws/occurrences/{UUID} ) Data currency - When the full_SOLR _index happens (This could be obtained from prod DWC imports from s3)
1. Last Checked:
2. Data Currency:
Needs to update collectory (ask Systems)
Also, ask to add these three fields to the collectory - DR first loaded: DR last loaded: DR last processed:
Field Name | Human Definition | Query |
---|---|---|
Last Checked | the last date that a process checked that data was available whether there is a new dataset to load or not | 1. timestamp on DwCA in dwca-imports/; 2. Preingestion to set this date on Biocollect/OBIS |
Data Currency | the date a dataset was delivered and ready to load | DwCA timestamp in dwca-imports/ |
First Loaded | the first date this data resource was loaded into biocache from a dwca | biocache min(first load date) |
Last Loaded | the last date this data resource was loaded into biocache from a dwca (ie verbatim AVRO stage) | biocache max(raw_lastModifiedTime) or max(lastLoadDate) |
Last Processed | the last date this data resource was reprocessed in biocache | max(lastProcessedDate) |
Update: 14 December, 2023 (8 PM)
Issue: Dates and history not reflecting upload dates
Actions Taken:
[x] Conducted a thorough review of the Preingestion code and identified a specific segment requiring modification.
[x] The relevant code snippet resides at line 175 in the file dataresource.py within the Preingestion repository, accessible here: [https://github.com/AtlasOfLivingAustralia/preingestion/blob/755eaeac357870e4c1fd620c7207c091ce5637e9/collectory/dataresource.py#L175].
The modification involves updating the collectory date using the following change in the code: _def update_lastchecked(self, date=datetime.datetime.utcnow())
The specific code in preingest_dr.py is as follows: _dr_obj.update_lastchecked(date=datetime.datetime.now())
[x] This adjustment ensures that the collectory date reflects the current Australian time
[x] This has been tested successfully on local
Next Step: Need to ask Mahmoud/Peggy where to add code for Data Currency
Status: Work Pending
Dates and History in collectory are not updating to reflect the status of data loads.
Collectory example dr343: Data updated/loaded on 2023-05-20
Collectory:
Biocache occurrence record: