gbif / crawler

The crawling pieces - ws, cli, coordinator
Apache License 2.0
4 stars 3 forks source link

BioCASE dataset has same crawl ID spread across different dates #28

Closed jlegind closed 4 years ago

jlegind commented 5 years ago

The German dataset Flora von Deutschland (Phanerogamen) https://www.gbif.org/dataset/e6fab7b3-c733-40b9-8df3-2a03e49532c1 has the crawlID 87 attached to 12 different crawl dates from 2016 to 2019:

flora

I don't know whats going on. Help appreciated.

HiveQL:

SELECT to_date(from_unixtime(cast(fragmentcreated/1000 as INT))), crawlid, count(*) FROM occurrence_hdfs
WHERE datasetkey = 'e6fab7b3-c733-40b9-8df3-2a03e49532c1' GROUP BY to_date(from_unixtime(cast(fragmentcreated/1000 as INT))), crawlid
jlegind commented 5 years ago

Here is another example with Collections and observation data National Museum of Natural History Luxembourg:

SELECT to_date(from_unixtime(cast(fragmentcreated/1000 as INT))), crawlid, count(*) FROM occurrence_hdfs
WHERE datasetkey = '962f59bc-f762-11e1-a439-00145eb45e9a' GROUP BY to_date(from_unixtime(cast(fragmentcreated/1000 as INT))), crawlid
MattBlissett commented 4 years ago

With pipelines, this is no longer a relevant issue.