GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
645 stars 101 forks source link

WAF file change detection #4425

Open FuhuXia opened 1 year ago

FuhuXia commented 1 year ago

WAF harvesting is relying on file timestamp to do preliminary content change detection. If the file timestamp is newer than what is in the DB, harvester will process the XML and read <metd> of fgdc and <gmd:dateStamp> of iso for further content change detection. If a large source refreshes all files' timestamps on a regular basis, such as this one, or if the WAF server is known type, file does not have a timestamp, such as this one, it creates a lot of unnecessary workload.

Other than making harvest job runs longer, still not clear how severe it affects harvesting in any other way. Maybe get job job stuck?

Looking at the fetch log, a job runs for 16 hours with constant eror message Document with GUID ### unchanged, skipping...

image

How to reproduce

See fetch logs for source https://catalog.data.gov/harvest/ngdc-paleo

Expected behavior

Take seconds to process unchanged WAF

Actual behavior

Take hours

Sketch

Inform WAF source maintainer not to update file timestamp if content is not changed. Or improve on our side similar to datajson source change detection.

FuhuXia commented 11 months ago

This continues to an issue for all NOAA sources.

Another negative impact of this issue on our system is that it makes our DB size bloated.

FuhuXia commented 11 months ago

This is also an issue for most (if not all) of census-gov harvest source. census' sources hosted on https://meta.geo.census.gov/ are using an unknown type of web server that display no timestamp for WAF file items.