Materials-Consortia / optimade-python-tools

Tools for implementing and consuming OPTIMADE APIs in Python
https://www.optimade.org/optimade-python-tools/
MIT License
71 stars 44 forks source link

Client: graceful way to pause/continue and deal with runtime errors? #1900

Open mkhorton opened 11 months ago

mkhorton commented 11 months ago

Hi there! Thanks for the great work on the OPTIMADE client, it's really pleasant to use, and looks very well designed. I appreciate it especially having written the (quite poorer) interface in pymatgen, which one day perhaps we can deprecate :)

While using the interface, I frequently encounter timeout errors with some databases. These might look like:

returned: [RuntimeError: 502 -

or

returned: ['RuntimeError: 500 -

This raises a few questions, and I'm not sure the best resolution:

  1. If using --output-file, the output file will be empty when an error is encountered, even if many structures were successfully retrieved. I understand using a callback is probably the best option here.
  2. I can't see a good way to automatically skip the first N pages, if they have already been successfully retrieved.
  3. I'm not sure if these two return codes will trigger the automatic retry; it looks like this will only happen for a TimeoutError, and not a RuntimeError?
  4. The automatic retry does not seem to have a backoff period. Some databases request waiting for 30s for example. Perhaps we could take inspiration from the retry library or similar, and have an automatically increasing sleep time after each retry?

Apologies if there is an existing issue for this. I did have a look but couldn't find one. If I'm mis-using the library and this is already supported somehow, I'd be glad to know! Thanks again!

ml-evs commented 11 months ago

Hi @mkhorton, glad you're finding it useful -- and all good questions!

To lump the first two questions together, callbacks are the only way of really doing this right now; my most robust workflow uses callbacks to rewrite the next link of a query such that it skips to the last result it finds in the database for a given provider. This is austerely documented at here and here, but I can try to dig out the specific code I've been using if that's helpful (and maybe add it as a selectable callback). The usage of --output-file was initially meant to just replicate doing a stdio redirect but could definitely be improved/specialised (e.g., writing a JSONLines file per database queried, then only needing to read the last line to do continuation -- this could also be achieved with a callback but its probably asking a lot of a user to write this).

Just found the code for the restartable mongo callback and its not pretty (note in this case there is no filter; potentially could consider having baked in features for making per-filter databases/output files): ```python if __name__ == "__main__": import pymongo as pm from optimade.client import OptimadeClient from httpx import URL import urllib.parse client = pm.MongoClient("mongodb://localhost:27017/optimade_example") collection = client.optimade_example.structures collection.create_index("immutable_id", unique=True) collection.create_index("prefix") def insert_into_mongo(url, results): """Inserts data into a MongoDB collection.""" prefix = results["meta"].get("provider", {}).get("prefix", None) url = URL(url) next_url = None duplicates = False # start = time.monotonic_ns() for entry in results["data"]: formula = entry.pop("attributes")["chemical_formula_reduced"] entry["chemical_formula_reduced"] = formula entry["prefix"] = prefix entry["immutable_id"] = f"{url.scheme}://{url.host}{url.path}/{entry['id']}" try: collection.insert_one(entry) except pm.errors.DuplicateKeyError: duplicates = True if duplicates: number_of_results_for_prefix = collection.count_documents( {"prefix": prefix} ) suggested_page_offset = number_of_results_for_prefix - 1 _next_url = results.get("links", {}).get("next") if isinstance(_next_url, dict): _next_url = _next_url.get("href") # If we have already reset the page offset once, don't do it again page_offset = urllib.parse.parse_qs( urllib.parse.urlparse(_next_url).query ).get("page_offset", [None])[0] if page_offset is None: return page_offset = int(page_offset) if _next_url and page_offset < 110: # Change the page offset to the suggested value using urllib.parse next_url = str( URL(_next_url).copy_set_param("page_offset", suggested_page_offset) ) if next_url: print( f"Overwriting next_url to {next_url}, existing results {suggested_page_offset + 1}" ) return {"next": next_url, "advance_results": number_of_results_for_prefix} # (time.monotonic_ns() - start) / 1e9 # print(f"Callback ran in {elapsed:.2f} s") return None download_structures = False client = OptimadeClient( max_results_per_provider=-1, # include_providers=["mpds", "omdb", "aflow"], callbacks=[insert_into_mongo], ) all_formulae = client.get(response_fields=["chemical_formula_reduced"]) ```

For final 2 questions:

  1. There's no good way to retry in that scenario currently; do you think the errors you are running into are really solved with this though? Most of the runtime errors I see with the client are missing database features or buggy implementations, rather than server flakiness. Bear in mind that there is already an implicit packet retransmission "retry" in TCP that the default timeout should allow for multiple windows of. This doesn't fix your next query though...
  2. This can definitely be added, I've just been holding off for #1247, #1418 and #1419 to solve this properly, as OPTIMADE 1.2 now has a field by which server's can request a backoff time in-band. If you think a simple staggered retry would solve your problems then we can definitely add it. I think I had it in an earlier version but always found that the databases that needed the retry were broken anyway, so the retry would never succeed and it made all queries significantly slower as a result.
mkhorton commented 11 months ago

Thanks @ml-evs, regarding:

This is austerely documented at here and here, but I can try to dig out the specific code I've been using if that's helpful (and maybe add it as a selectable callback).

If you do have code on hand, that'd be super helpful! otherwise I can muddle through.

do you think the errors you are running into are really solved with this though?

I think it's a mix. Some providers just need some extra time/backoff period, other providers have genuine issues. I think a good test maybe to use some filter that returns a large amount of documents, and then arbitrarily try a page with a very high page number. If it works, great, if it doesn't work, it probably suggests some underlying server issue.

If you think a simple staggered retry would solve your problems then we can definitely add it.

Difficult to say; I'd be in favor of adding it for politeness regardless, since some people might not add that request_delay field to their response. But if you've already tried it... One point of confusion for me is the flow for when a RuntimeError is encountered; is this handled the same as a TimeoutError (i.e., it will re-attempt 5 times), or does it fail immediately?

ml-evs commented 11 months ago

If you do have code on hand, that'd be super helpful! otherwise I can muddle through.

Sorry I hid it somewhat in my comment above, you should be able to expand the sentence "Just found the code..." to see the snippet.

I think it's a mix. Some providers just need some extra time/backoff period, other providers have genuine issues. I think a good test maybe to use some filter that returns a large amount of documents, and then arbitrarily try a page with a very high page number. If it works, great, if it doesn't work, it probably suggests some underlying server issue.

Large queries are still a bit of an issue; until recently we still had a whole COLSCAN going on in the reference implementation as we needed to get the number of returned entries, but instead we have now made this optional (and obey a MongoDB internal timeout). Mostly we have been getting away with this by just running sufficiently small databases with enough memory to make this access fast as I really don't want to have to mess around with cursor pagination and such in MongoDB (the OPTIMADE spec is designed such that this should be possible though). I know that the Elasticsearch-based implementations also struggle with more than 10,000 results by default unless you implement the Scroll API (which I do not have the bandwidth or expertise to do in optimade-python-tools) (see #1291). We can definitely try to be more robust to this though.

Difficult to say; I'd be in favor of adding it for politeness regardless, since some people might not add that request_delay field to their response. But if you've already tried it... One point of confusion for me is the flow for when a RuntimeError is encountered; is this handled the same as a TimeoutError (i.e., it will re-attempt 5 times), or does it fail immediately?

I'll remind myself of our current approach and consider adding this, should be straightforward.