Open matkoniecz opened 4 months ago
currently .zip has single "output" folder. I propose to include also output_metadata
folder
There is a field in the output JSON called something like dataset_properties
that has information about the spider. I could see adding some of this data to that output.
I'm still of the opinion that a data consumer shouldn't care about this though. Maybe the right thing to do is to offer a "latest" endpoint that gives out the most recent successful output for any particular spider?
I'm still of the opinion that a data consumer shouldn't care about this though.
maybe I git too invested into improving ATP but it would be nice to be notified that spider started to fail and investigate whether I can help with it - and listing all failed spiders is not useful here as many are caused by proxy issues
it would be nice to be notified that spider started to fail
Yea, this would be great! I used to have the weekly create/reopen a Github ticket when a spider failed for whatever reason. That got really noisy. Can you think of a better place to put this kind of thing?
when a spider failed for whatever reason
that is definitely to noisy due to how many spiders fail on proxy problems
Can you think of a better place to put this kind of thing?
Some dashboard on website that would allow to sort/filter by failure reason?
https://alltheplaces-data.openaddresses.io/runs/2024-07-06-13-31-59/output.zip seems to not include data such as https://alltheplaces-data.openaddresses.io/runs/2024-07-06-13-31-59/stats/daylight_donuts_us.json
maybe it would be nice to include it so it is easier to find and people wishing to use it do not need to fetch it manually for each spider? Or is it rare enough in use to make fetching it for interested proper solution?
(note, I was looking for such info and I was confused enough to create #8791)