Open rshewitt opened 2 weeks ago
For this particular case, the stuck job is directly related to tons of tags(keywords) in some xml records, 34,866
to be exact for the sampled one. Large file size is also because of tons of tags. So we can set a max limit of tags allowed, we can reject this kind of non-sense records and not get job stuck.
Rejecting records based on file size may be too broad.
To Fuhu's point, we should set a reasonably high limit for each field, publicize it somewhere, and then hard fail the datasets when they exceed that limit. In H2.0 we can even throw custom errors to highlight this.
If we can set the limit ridiculously high, say, 3000, maybe we can get away without publicizing it, because it will be really rare for any record to reach that limit. And when it does, people will know why the dataset fails to be harvested because it is ridiculous. Who would create a dataset with 1500 resources or 3000 keywords.
noaa-nesdis-ncei-accessions has some datasets which cause an out-of-memory error in catalog-fetch ( i.e. the log message is "Killed" ). related to 1487. here's a dataset which managed to be created after increasing catalog-fetch memory but because of its size the server responds with a 500 in the UI.
How to reproduce
Expected behavior
the job is completed without timeout.
Actual behavior
the job is stuck and times out after the 72 hour limit.
Sketch
[Notes or a checklist reflecting our understanding of the selected approach]