Closed dshorthouse closed 9 months ago
Apologies, I can see that isn't correct.
Would you like the new eventDate value supporting ranges (i.e. 1850
) or the previous behaviour of the lower bound of the range (1850-01-01T00:00:00
)?
Thanks for taking a look. The previous behaviour with the lower bound range is probably the safest bet.
The fix for this is now deployed, thanks for reporting it.
Awesome! Thanks for this.
Apologies for another hiccup. a recent download from GBIF, https://doi.org/10.15468/dl.bv7t8t has produced values like '19634-02-24 05:00:00' in the processed eventDate
found here, https://github.com/gbif/occurrence/blob/dev/occurrence-download/src/main/resources/download-workflow/bionomia/hive-scripts/execute-bionomia-query.q#L39. While technically a correctly formatted date, it causes grief downstream.
I see these for dateIdentified
, but not eventDate
. If you have an example with eventDate
> 2024 I'd like the gbifid.
Changing this in the interpretation isn't a quick fix, and I'm not sure what the correct fix is — the records do have the appropriate issue flag.
Thanks, @MattBlissett. I knew I should have grabbed a gbifID
and my apologies for not having done so. I've implemented an inelegant work-around at my end in my Scala/Spark SQL script:
withColumn("eventDate_processed", when(to_timestamp($"eventDate_processed").lt(current_timestamp()), to_date(to_timestamp($"eventDate_processed"), "YYY-MM-dd")).otherwise(null))
...but the next time I work through a download in a few weeks, I'll revert to what I had so I can get you a few gbifIDs
.
I had an even more inelegant shell script:
for i in occurrence.avro/00*; do
echo $i
java -jar ~/avro-tools-1.10.2.jar tojson $i 2> /dev/null | jq -c '. | {id: .gbifID, eventDate: .eventDate.string}' | \
grep -v '"[12][567890]..-[01].-[0123].T..:..:.."' | grep -v 'null'
done
I found some with dateIdentified
(https://www.gbif.org/occurrence/2997498045, 1978874621, 872723035) so there's no need to do that, but I didn't notice any with eventDate.
I am aware that heroics were recently applied to handling
eventDate
anddateIdentified
, but am unsure if the present issue for Bionomia is an unaccommodated outcome or some other issue.The hive script for a Bionomia download is at https://github.com/gbif/occurrence/blob/dev/occurrence-download/src/main/resources/download-workflow/bionomia/hive-scripts/execute-bionomia-query.q#L39. Note the highlighted toLocalISO8601 local temp function, evidently being pulled from 'org.gbif.occurrence.hive.udf.ToISO8601UDF', and applied to
eventDate
anddateIdentified
.The dates that appear in a Bionomia download through the above are completely off the mark. For instance, the 1850 raw
eventDate
on https://www.gbif.org/occurrence/2435994636 comes through processed as 1969-11-18T04:06:14.4.Have I missed something critical @MattBlissett?