alltheplaces / alltheplaces

A set of spiders and scrapers to extract location information from places that post their location on the internet.
https://www.alltheplaces.xyz
Other
607 stars 204 forks source link

include processing metadata in bulk download? #8845

Open matkoniecz opened 1 month ago

matkoniecz commented 1 month ago

https://alltheplaces-data.openaddresses.io/runs/2024-07-06-13-31-59/output.zip seems to not include data such as https://alltheplaces-data.openaddresses.io/runs/2024-07-06-13-31-59/stats/daylight_donuts_us.json

maybe it would be nice to include it so it is easier to find and people wishing to use it do not need to fetch it manually for each spider? Or is it rare enough in use to make fetching it for interested proper solution?

(note, I was looking for such info and I was confused enough to create #8791)

matkoniecz commented 1 month ago

currently .zip has single "output" folder. I propose to include also output_metadata folder

iandees commented 1 month ago

There is a field in the output JSON called something like dataset_properties that has information about the spider. I could see adding some of this data to that output.

I'm still of the opinion that a data consumer shouldn't care about this though. Maybe the right thing to do is to offer a "latest" endpoint that gives out the most recent successful output for any particular spider?

matkoniecz commented 1 month ago

I'm still of the opinion that a data consumer shouldn't care about this though.

maybe I git too invested into improving ATP but it would be nice to be notified that spider started to fail and investigate whether I can help with it - and listing all failed spiders is not useful here as many are caused by proxy issues

iandees commented 1 month ago

it would be nice to be notified that spider started to fail

Yea, this would be great! I used to have the weekly create/reopen a Github ticket when a spider failed for whatever reason. That got really noisy. Can you think of a better place to put this kind of thing?

matkoniecz commented 1 month ago

when a spider failed for whatever reason

that is definitely to noisy due to how many spiders fail on proxy problems

Can you think of a better place to put this kind of thing?

Some dashboard on website that would allow to sort/filter by failure reason?