Open FuhuXia opened 1 year ago
This continues to an issue for all NOAA sources.
Another negative impact of this issue on our system is that it makes our DB size bloated.
This is also an issue for most (if not all) of census-gov harvest source. census' sources hosted on https://meta.geo.census.gov/ are using an unknown type of web server that display no timestamp for WAF file items.
WAF harvesting is relying on file timestamp to do preliminary content change detection. If the file timestamp is newer than what is in the DB, harvester will process the XML and read
<metd>
of fgdc and<gmd:dateStamp>
of iso for further content change detection. If a large source refreshes all files' timestamps on a regular basis, such as this one, or if the WAF server is known type, file does not have a timestamp, such as this one, it creates a lot of unnecessary workload.Other than making harvest job runs longer, still not clear how severe it affects harvesting in any other way. Maybe get job job stuck?
Looking at the fetch log, a job runs for 16 hours with constant eror message
Document with GUID ### unchanged, skipping...
How to reproduce
See fetch logs for source https://catalog.data.gov/harvest/ngdc-paleo
Expected behavior
Take seconds to process unchanged WAF
Actual behavior
Take hours
Sketch
Inform WAF source maintainer not to update file timestamp if content is not changed. Or improve on our side similar to datajson source change detection.