Use original crawl data where possible

MattBlissett commented 6 years ago

For DWCAs, we may have the original DWCA which we downloaded.

For example, for the first dataset in the list 99fab784-1bd0-4e41-9039-a9f0f41b63f1 we have

$ unzip -l 99fab784-1bd0-4e41-9039-a9f0f41b63f1.dwca
Archive:  99fab784-1bd0-4e41-9039-a9f0f41b63f1.dwca
  Length      Date    Time    Name
---------  ---------- -----   ----
     5555  07-18-2016 15:42   eml.xml
     1254  07-18-2016 15:42   meta.xml
     2711  07-18-2016 15:42   occurrence.txt
---------                     -------
     9520                     3 files

Sometimes the .dwca file will have been overwritten (e.g. with a 404 page), but we could have the unpacked archive.

For XML based crawls, we may have the fragments on disk, and also in HBase: https://api.gbif.org/v1/occurrence/1227962623/fragment

In either case, there may be data we don't put in a download (especially extensions) which should be preserved.

We don't always have this data. I don't know when fragments started to be stored, but the oldest records don't have them.

timrobertson100 commented 6 years ago

Looks like there could be approximately 1575 datasets purporting to be published as DwC-A but where no DwC-A is available and would need to be sourced from HBase data.

This quick analysis is only undeleted occurrence datasets where a DwC-A endpoint exists.

Procedure used for this:

From the registry:

SELECT  DISTINCT d.key
FROM dataset d
JOIN dataset_endpoint de ONde.dataset_key = d.key
JOIN endpoint e ON de.endpoint_key = e.key
WHERE d.deleted is null AND d.type='OCCURRENCE'
AND e.type='DWC_ARCHIVE'

gives 4916 DwC-A occurrence datasets in total.

Getting directory listings from prodcrawler1-vh.gbif.org:/mnt/auto/crawler/dwca (mounted as Hive table ds_dirNew) and prodcrawler1-vh.gbif.org:/mnt/auto/crawler/dwca (mounted as hive table ds_dir) and issuing:

SELECT count(distinct t1.datasetkey) 
FROM tim.ds_all t1 
LEFT JOIN tim.ds_dir t2 on t1.datasetkey = t2.datasetkey
LEFT JOIN tim.ds_dirnew t3 on t1.datasetkey = t3.datasetkey
WHERE (t2.datasetkey IS NULL OR t3.datasetkey IS NULL)

kbraak commented 6 years ago

Thanks for raising this issue @MattBlissett.

As discussed with @ahahn-gbif and @timrobertson100 while you were away, the plan is to adopt the orphans in two phases:

In phase 1 in 2017, we'll adopt all datasets belonging to Nodes that never replied during the campaign.
In phase 2 in 2018, we'll adopt the rest of the orphan datasets, which belong to Nodes that did reply during the campaign but need more time to investigate.

This wiki page provides a detailed list of all orphan datasets broken down first by phase and then by Node.

As per this issue, we'll check whether the original DwC-A can be salvaged from our disk. This will also allow us to rescue checklists, which cannot be scraped from the occurrence store in the same way as occurrence datasets can.

Where a dataset cannot be salvaged from disk, the rescue script I wrote can be used in order to scrape the dataset from GBIF.org.

Ultimately, all orphan datasets will be adopted by hosting them in a separate IPT - one for each Node. This will facilitate both their curation and eventual transfer, in cases where the Node prefers to host the datasets themselves. Each Node's list of orphan datasets is available from the above wiki page and can also be retrieved here.

MattBlissett commented 6 years ago

Sounds fine, I caught a few words from Tim just before I left.

Would that mean 25+20 new IPTs? That would be a 10× increase in what we host at present, which sounds like a bit of a hassle regarding user accounts, upgrades and RAM usage.

MattBlissett commented 6 years ago

Quick discussion with Tim on the reasons behind this -- it should be OK, but I'd like to look at using Jetty rather than Tomcat to improve reliability and administration.

gbif / watchdog

Use original crawl data where possible #25