CityofSantaMonica / mds-provider

Python tools for working with MDS Provider data
https://github.com/openmobilityfoundation/mobility-data-specification
MIT License
18 stars 20 forks source link

raise error if not 200 sstatus code #82

Closed hunterowens closed 4 years ago

hunterowens commented 4 years ago

Currently, if get_status_changes or get_trips runs and returns a not 200 status code, the __describe() method is run but doesn't actually raise a python error, meaning the code continues to run.

This PR wraps the request block inside try/ except that runs __describe and then raises an error, which should stop program execution.

thekaveman commented 4 years ago

I have mixed feelings about this. I see plenty of cases where a given page/request will timeout with a 504 or similar but then the next page/request succeeds. When doing a large backfill, it's helpful to be able to pass over these errors especially when the next backfill request may partially overlap the current, such that the timeout/miss is handled "eventually". Here's a recent example from my logs:

Requesting trips from Lime
Time range: 2019-09-08T12:00:00-07:00 to 2019-09-09T12:00:00-07:00
Requested https://data.lime.bike/api/partners/v1/mds/santa_monica/trips?min_end_time=1567969200000&max_end_time=1568055600000, Response Code: 504

# more error output...

Validating trips @ 0.3.2
0 records, 0 passed, 0 failed
Skipping data load
trips complete
Requesting trips from Lime
Time range: 2019-09-08T00:00:00-07:00 to 2019-09-09T00:00:00-07:00
Got payload with 1000 trips
Got payload with 1000 trips
Got payload with 522 trips
Validating trips @ 0.3.2
2522 records, 2522 passed, 0 failed
Loading trips
No POSTGRES_HOST_PORT environment variable set, defaulting to: 5432
trips complete

The request for 2019-09-08T12:00:00-07:00 to 2019-09-09T12:00:00-07:00 failed but then the backfill kept going and the request for 2019-09-08T00:00:00-07:00 to 2019-09-09T00:00:00-07:00 succeeded.

For the backfill case, a try/except at the ETL level would allow for restarting/retrying/continuing. But if we raise in the middle of paging, there isn't really a clean way to continue with subsequent pages.

ian-r-rose commented 4 years ago

Perhaps using a library like this one could allow for retrying API endpoints, while still allowing for non-ephemeral failures to propagate up to the system?

thekaveman commented 4 years ago

Ah... let me back up, because I wasn't thinking clearly with my earlier comment. Obviously if a given page throws an error we wouldn't get anything back to have a next query, so that case is moot. It isn't so much the retry/backoff aspect I am concerned about here - that could be handled by a user of this library with standard try/except semantics or another helper library like @ian-r-rose linked.

I think the core issue is, if we raise during any one page request, all previously successful pages are thrown away. So if page 1...9 return data and then page 10 times out, we lose everything previously gathered. (Thinking especially of the Bird case with potentially many pages representing only a few minutes of time each from the overall request window).

This is probably more an argument in favor of something like proposed/discussed in #13 and CityofSantaMonica/mds-provider-services#61 than anything else. On that note, @hunterowens would you mind rebasing on the latest master? The client underwent some significant refactors since 5f1c51e (the commit this branch is based on).

ian-r-rose commented 4 years ago

I think the core issue is, if we raise during any one page request, all previously successful pages are thrown away. So if page 1...9 return data and then page 10 times out, we lose everything previously gathered. (Thinking especially of the Bird case with potentially many pages representing only a few minutes of time each from the overall request window).

I wonder if, in practice, using something like exponential backoff would be effective in mitigating failures like what you describe @thekaveman. We would then only raise if the last attempt (maybe 10th try, could be configurable) failed. We would still lose prior pages, but maybe not very often, and a true failure-after-ten-retries would be something worth looking at.

This is probably more an argument in favor of something like proposed/discussed in #13 and CityofSantaMonica/mds-provider-services#61 than anything else. On that note, @hunterowens would you mind rebasing on the latest master? The client underwent some significant refactors since 5f1c51e (the commit this branch is based on).

I don't think it would be a problem to move to an asynchonous execution model, though it probably wouldn't help too much, since each page link is derived from the previous one, so the amount of concurrency is pretty low (unless you start to parallelize across different providers or different time windows).

thekaveman commented 4 years ago

These changes would need to be rebased on latest master. My points above may be less relevant now given MDS 0.4.x changes for static-file backed status_changes and trips endpoints that no longer support paging.

Suggest opening a new PR if changes are still desired.