gbif / occurrence

Occurrence store, download, search
Apache License 2.0
22 stars 15 forks source link

Processed eventDate and dateIdentified incomprehensible #339

Closed dshorthouse closed 9 months ago

dshorthouse commented 9 months ago

I am aware that heroics were recently applied to handling eventDate and dateIdentified, but am unsure if the present issue for Bionomia is an unaccommodated outcome or some other issue.

The hive script for a Bionomia download is at https://github.com/gbif/occurrence/blob/dev/occurrence-download/src/main/resources/download-workflow/bionomia/hive-scripts/execute-bionomia-query.q#L39. Note the highlighted toLocalISO8601 local temp function, evidently being pulled from 'org.gbif.occurrence.hive.udf.ToISO8601UDF', and applied to eventDate and dateIdentified.

The dates that appear in a Bionomia download through the above are completely off the mark. For instance, the 1850 raw eventDate on https://www.gbif.org/occurrence/2435994636 comes through processed as 1969-11-18T04:06:14.4.

Have I missed something critical @MattBlissett?

MattBlissett commented 9 months ago

Apologies, I can see that isn't correct.

Would you like the new eventDate value supporting ranges (i.e. 1850) or the previous behaviour of the lower bound of the range (1850-01-01T00:00:00)?

dshorthouse commented 9 months ago

Thanks for taking a look. The previous behaviour with the lower bound range is probably the safest bet.

MattBlissett commented 9 months ago

The fix for this is now deployed, thanks for reporting it.

dshorthouse commented 9 months ago

Awesome! Thanks for this.

dshorthouse commented 9 months ago

Apologies for another hiccup. a recent download from GBIF, https://doi.org/10.15468/dl.bv7t8t has produced values like '19634-02-24 05:00:00' in the processed eventDate found here, https://github.com/gbif/occurrence/blob/dev/occurrence-download/src/main/resources/download-workflow/bionomia/hive-scripts/execute-bionomia-query.q#L39. While technically a correctly formatted date, it causes grief downstream.

MattBlissett commented 9 months ago

I see these for dateIdentified, but not eventDate. If you have an example with eventDate > 2024 I'd like the gbifid.

Changing this in the interpretation isn't a quick fix, and I'm not sure what the correct fix is — the records do have the appropriate issue flag.

dshorthouse commented 9 months ago

Thanks, @MattBlissett. I knew I should have grabbed a gbifID and my apologies for not having done so. I've implemented an inelegant work-around at my end in my Scala/Spark SQL script:

withColumn("eventDate_processed", when(to_timestamp($"eventDate_processed").lt(current_timestamp()), to_date(to_timestamp($"eventDate_processed"), "YYY-MM-dd")).otherwise(null))

...but the next time I work through a download in a few weeks, I'll revert to what I had so I can get you a few gbifIDs.

MattBlissett commented 9 months ago

I had an even more inelegant shell script:

for i in occurrence.avro/00*; do
  echo $i
  java -jar ~/avro-tools-1.10.2.jar tojson $i 2> /dev/null | jq -c '. | {id: .gbifID, eventDate: .eventDate.string}' | \
  grep -v '"[12][567890]..-[01].-[0123].T..:..:.."' | grep -v 'null'
done

I found some with dateIdentified (https://www.gbif.org/occurrence/2997498045, 1978874621, 872723035) so there's no need to do that, but I didn't notice any with eventDate.