NOAA harvest job stuck because of large file size

rshewitt commented 2 weeks ago

noaa-nesdis-ncei-accessions has some datasets which cause an out-of-memory error in catalog-fetch ( i.e. the log message is "Killed" ). related to 1487. here's a dataset which managed to be created after increasing catalog-fetch memory but because of its size the server responds with a 500 in the UI.

https://catalog-prod-admin-datagov.app.cloud.gov/api/action/package_show?id=noaa-global-drifter-program-quality-controlled-6-hour-interpolated-data-from-ocean-surface-drif

How to reproduce

harvest the source

Expected behavior

the job is completed without timeout.

Actual behavior

the job is stuck and times out after the 72 hour limit.

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

FuhuXia commented 2 weeks ago

For this particular case, the stuck job is directly related to tons of tags(keywords) in some xml records, 34,866 to be exact for the sampled one. Large file size is also because of tons of tags. So we can set a max limit of tags allowed, we can reject this kind of non-sense records and not get job stuck.

Rejecting records based on file size may be too broad.

btylerburton commented 2 weeks ago

To Fuhu's point, we should set a reasonably high limit for each field, publicize it somewhere, and then hard fail the datasets when they exceed that limit. In H2.0 we can even throw custom errors to highlight this.

FuhuXia commented 2 weeks ago

If we can set the limit ridiculously high, say, 3000, maybe we can get away without publicizing it, because it will be really rare for any record to reach that limit. And when it does, people will know why the dataset fails to be harvested because it is ridiculous. Who would create a dataset with 1500 resources or 3000 keywords.

GSA / data.gov