Closed MattBlissett closed 6 years ago
Looks like there could be approximately 1575
datasets purporting to be published as DwC-A but where no DwC-A is available and would need to be sourced from HBase data.
This quick analysis is only undeleted occurrence datasets where a DwC-A endpoint exists.
Procedure used for this:
From the registry:
SELECT DISTINCT d.key
FROM dataset d
JOIN dataset_endpoint de ONde.dataset_key = d.key
JOIN endpoint e ON de.endpoint_key = e.key
WHERE d.deleted is null AND d.type='OCCURRENCE'
AND e.type='DWC_ARCHIVE'
gives 4916
DwC-A occurrence datasets in total.
Getting directory listings from prodcrawler1-vh.gbif.org:/mnt/auto/crawler/dwca
(mounted as Hive table ds_dirNew
) and prodcrawler1-vh.gbif.org:/mnt/auto/crawler/dwca
(mounted as hive table ds_dir
) and issuing:
SELECT count(distinct t1.datasetkey)
FROM tim.ds_all t1
LEFT JOIN tim.ds_dir t2 on t1.datasetkey = t2.datasetkey
LEFT JOIN tim.ds_dirnew t3 on t1.datasetkey = t3.datasetkey
WHERE (t2.datasetkey IS NULL OR t3.datasetkey IS NULL)
Thanks for raising this issue @MattBlissett.
As discussed with @ahahn-gbif and @timrobertson100 while you were away, the plan is to adopt the orphans in two phases:
This wiki page provides a detailed list of all orphan datasets broken down first by phase and then by Node.
As per this issue, we'll check whether the original DwC-A can be salvaged from our disk. This will also allow us to rescue checklists, which cannot be scraped from the occurrence store in the same way as occurrence datasets can.
Where a dataset cannot be salvaged from disk, the rescue script I wrote can be used in order to scrape the dataset from GBIF.org.
Ultimately, all orphan datasets will be adopted by hosting them in a separate IPT - one for each Node. This will facilitate both their curation and eventual transfer, in cases where the Node prefers to host the datasets themselves. Each Node's list of orphan datasets is available from the above wiki page and can also be retrieved here.
Sounds fine, I caught a few words from Tim just before I left.
Would that mean 25+20 new IPTs? That would be a 10× increase in what we host at present, which sounds like a bit of a hassle regarding user accounts, upgrades and RAM usage.
Quick discussion with Tim on the reasons behind this -- it should be OK, but I'd like to look at using Jetty rather than Tomcat to improve reliability and administration.
For DWCAs, we may have the original DWCA which we downloaded.
For example, for the first dataset in the list
99fab784-1bd0-4e41-9039-a9f0f41b63f1
we haveSometimes the
.dwca
file will have been overwritten (e.g. with a 404 page), but we could have the unpacked archive.For XML based crawls, we may have the fragments on disk, and also in HBase: https://api.gbif.org/v1/occurrence/1227962623/fragment
In either case, there may be data we don't put in a download (especially extensions) which should be preserved.
We don't always have this data. I don't know when fragments started to be stored, but the oldest records don't have them.