Open mkhorton opened 11 months ago
Hi @mkhorton, glad you're finding it useful -- and all good questions!
To lump the first two questions together, callbacks are the only way of really doing this right now; my most robust workflow uses callbacks to rewrite the next
link of a query such that it skips to the last result it finds in the database for a given provider. This is austerely documented at here and here, but I can try to dig out the specific code I've been using if that's helpful (and maybe add it as a selectable callback). The usage of --output-file
was initially meant to just replicate doing a stdio redirect but could definitely be improved/specialised (e.g., writing a JSONLines file per database queried, then only needing to read the last line to do continuation -- this could also be achieved with a callback but its probably asking a lot of a user to write this).
For final 2 questions:
Thanks @ml-evs, regarding:
This is austerely documented at here and here, but I can try to dig out the specific code I've been using if that's helpful (and maybe add it as a selectable callback).
If you do have code on hand, that'd be super helpful! otherwise I can muddle through.
do you think the errors you are running into are really solved with this though?
I think it's a mix. Some providers just need some extra time/backoff period, other providers have genuine issues. I think a good test maybe to use some filter that returns a large amount of documents, and then arbitrarily try a page with a very high page number. If it works, great, if it doesn't work, it probably suggests some underlying server issue.
If you think a simple staggered retry would solve your problems then we can definitely add it.
Difficult to say; I'd be in favor of adding it for politeness regardless, since some people might not add that request_delay
field to their response. But if you've already tried it... One point of confusion for me is the flow for when a RuntimeError is encountered; is this handled the same as a TimeoutError (i.e., it will re-attempt 5 times), or does it fail immediately?
If you do have code on hand, that'd be super helpful! otherwise I can muddle through.
Sorry I hid it somewhat in my comment above, you should be able to expand the sentence "Just found the code..." to see the snippet.
I think it's a mix. Some providers just need some extra time/backoff period, other providers have genuine issues. I think a good test maybe to use some filter that returns a large amount of documents, and then arbitrarily try a page with a very high page number. If it works, great, if it doesn't work, it probably suggests some underlying server issue.
Large queries are still a bit of an issue; until recently we still had a whole COLSCAN going on in the reference implementation as we needed to get the number of returned entries, but instead we have now made this optional (and obey a MongoDB internal timeout). Mostly we have been getting away with this by just running sufficiently small databases with enough memory to make this access fast as I really don't want to have to mess around with cursor pagination and such in MongoDB (the OPTIMADE spec is designed such that this should be possible though). I know that the Elasticsearch-based implementations also struggle with more than 10,000 results by default unless you implement the Scroll API (which I do not have the bandwidth or expertise to do in optimade-python-tools) (see #1291). We can definitely try to be more robust to this though.
Difficult to say; I'd be in favor of adding it for politeness regardless, since some people might not add that
request_delay
field to their response. But if you've already tried it... One point of confusion for me is the flow for when a RuntimeError is encountered; is this handled the same as a TimeoutError (i.e., it will re-attempt 5 times), or does it fail immediately?
I'll remind myself of our current approach and consider adding this, should be straightforward.
Hi there! Thanks for the great work on the OPTIMADE client, it's really pleasant to use, and looks very well designed. I appreciate it especially having written the (quite poorer) interface in pymatgen, which one day perhaps we can deprecate :)
While using the interface, I frequently encounter timeout errors with some databases. These might look like:
or
This raises a few questions, and I'm not sure the best resolution:
--output-file
, the output file will be empty when an error is encountered, even if many structures were successfully retrieved. I understand using a callback is probably the best option here.TimeoutError
, and not aRuntimeError
?retry
library or similar, and have an automatically increasing sleep time after each retry?Apologies if there is an existing issue for this. I did have a look but couldn't find one. If I'm mis-using the library and this is already supported somehow, I'd be glad to know! Thanks again!