AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

BioCollect datasets are being added without licence information and with duplication #940

Open djtfmartin opened 1 year ago

djtfmartin commented 1 year ago

The airflow preingestion job that adds BioCollect datasets is not setting the licence in the collectory. The last BioCollect harvested added 211 new datasets, all of which dont have licences. Example recently added: https://collections.ala.org.au/public/show/dr22260

In addition duplicates are being added:

See recent work on #934 to clean up these.

cc @peggynewman

djtfmartin commented 11 months ago

is anyone looking into this one ?

patkyn commented 11 months ago

Thanks @djtfmartin for flagging these. I'll check with the biocollect team on the duplicated drs that's created. The data resources are currently created automatically by biocollect when the project is setup.

The preingestion job currently only harvest those dataresources from this api https://ecodata.ala.org.au/ws//record/listHarvestDataResource?max=200&offset=0&sort=asc image

@temi

djtfmartin commented 11 months ago

More duplicates added in the last run:

and

peggynewman commented 11 months ago

Looks like licenses are provided at record level. @temi these duplicates from biocollect are still appearing in the collectory.... is this related to https://github.com/AtlasOfLivingAustralia/biocollect/issues/1509 ?

temi commented 11 months ago

Thanks for letting me know @djtfmartin @peggynewman. I thought this was fixed. But unfortunately there was a silent error which is causing this issue. I have a fix for it and will deploy it soon.

temi commented 11 months ago

@peggynewman @pbrenton @patkyn FYI, I have updated the data resource id on 127 projects. This will cause a big release of records from BioCollect on the next ingest. There are around 162,460 at the moment.