AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

Collectory: Dates and history not reflecting upload dates #908

Open rosemaryjoconnor opened 1 year ago

rosemaryjoconnor commented 1 year ago

Dates and History in collectory are not updating to reflect the status of data loads.

Collectory example dr343: Data updated/loaded on 2023-05-20

Collectory:

Biocache occurrence record:

peggynewman commented 1 year ago

iNaturalist is a good example: https://collections.ala.org.au/public/show/dr1411

Last checked date Most recent data published on

Image

What is the best way to represent this on collectory?

Dates should probably be updated in the collectory using a DAG when the index is run.

Explore load dates in biocache. Suggest an approach to fix.

peggynewman commented 1 year ago

We need to define these:

Last checked: Date the fetcher was run in pre-ingestion What if there is no fetcher?

Data currency: Discuss: Load_dataset date vs Index publish date

@sadeghim @patkyn what do you think?

The first seems to be a pre-ingestion job The second is perhaps a pipelines-airflow job ?

sadeghim commented 1 year ago

My impression is that both are very close as we use pre-ingestion fo all loads now. And updating collectory can happen in preingestion for all DRs not matter if there is any fetcher for them or not. Since right after pre-ingestion we trigger load, then the difference will be around minutes to hours. In that sense, I think we need to have a clear definition of these fields to differentiate them otherwise they both have almost same value and one of them is redundant.

peggynewman commented 1 year ago

After discussion:

These fields in the collectory should be populated by a DAG preingestion: Last Checked - the last date that a process checked that data was available whether there is a new dataset to load or not, Date update via Pre-ingestion Data Currency - the last date that data was received by ALA, meaning there was a new dataset to ingest in the ALA. The date can be set via full-Index DAG Timestamp of the last dataset: (talked about this and don't need to add it to collectory as it needs implementation in the code)

Change collectory to also add these biocache queries: DR first loaded: min(first load date) - the date the data resource was first loaded into the ALA (DwCA to Verbatim AVRO) DR last loaded: max(last load date) - the date the darwin core archive was last loaded into the ALA (DwCA to Verbatim AVRO) DR last processed: max(last processed) (Pipelines)

sadeghim commented 1 year ago

Tasks:

peggynewman commented 10 months ago

DR first loaded: min(first load date) DR last loaded: max(raw_lastModifiedTime) - date verbatim AVRO written DR last processed: max(lastModifiedTime) - date interpreted AVRO written SEe https://github.com/AtlasOfLivingAustralia/biocache-service/issues/846 for detail

cha801p commented 10 months ago

Last checked - lastModifiedTime (This could be obtained from https://biocache.ala.org.au/ws/occurrences/{UUID} ) Data currency - When the full_SOLR _index happens (This could be obtained from prod DWC imports from s3)

  1. Run a separate DAG or add a step to FULL-index to obtain the above dates
  2. Pass these dates to collectory
  3. Poppulate these dates on collectory
cha801p commented 10 months ago

1. Last Checked:

2. Data Currency:

peggynewman commented 10 months ago
Field Name Human Definition Query
Last Checked the last date that a process checked that data was available whether there is a new dataset to load or not 1. timestamp on DwCA in dwca-imports/; 2. Preingestion to set this date on Biocollect/OBIS
Data Currency the date a dataset was delivered and ready to load DwCA timestamp in dwca-imports/
First Loaded the first date this data resource was loaded into biocache from a dwca biocache min(first load date)
Last Loaded the last date this data resource was loaded into biocache from a dwca (ie verbatim AVRO stage) biocache max(raw_lastModifiedTime) or max(lastLoadDate)
Last Processed the last date this data resource was reprocessed in biocache max(lastProcessedDate)
cha801p commented 9 months ago

Update: 14 December, 2023 (8 PM)

Issue: Dates and history not reflecting upload dates

Actions Taken:

The modification involves updating the collectory date using the following change in the code: _def update_lastchecked(self, date=datetime.datetime.utcnow())

The specific code in preingest_dr.py is as follows: _dr_obj.update_lastchecked(date=datetime.datetime.now())

Next Step: Need to ask Mahmoud/Peggy where to add code for Data Currency

Status: Work Pending