astropy / astroquery

Functions and classes to access online data resources. Maintainers: @keflavich and @bsipocz and @ceb8
http://astroquery.readthedocs.org/en/latest/
BSD 3-Clause "New" or "Revised" License
702 stars 396 forks source link

API discussion: keeping duplicated methods `xyz`/`xyz_async` vs having an async kwarg? #2598

Open bsipocz opened 1 year ago

bsipocz commented 1 year ago

While I was reviewing https://github.com/astropy/astroquery/pull/2597, I run into this quasi-dilemma, but I raise it as a separate question for @keflavich and @andamian, as well as @ceb8.

We have a lot of historical baggage around our API, and I would like to revisit some of it to see whether they still make sense, or we should slowly deprecate them out. At some point we need to switch versioning anyway, so one big breath and breaking API doesn't feel off the table.

So, the question for async and sync jobs: we traditionally have method_async and method, auto-generated with async_to_sync. But some modules don't follow this, and some modules manually have duplicated methods. So I wonder, now as we move on towards using proper VO tools, and pyvo, whether we want to revisit our API preferences, and would rather go with this kwarg-driven approach? Or the method duplication is better? (And frankly, it's baked into the API too much already that if it's the same, we could just stick with it).

I very much interested in the take from Adrian, whether it made any sense to you while redoing e.g. alma, or it was something to go along with?

andamian commented 1 year ago

alma has only sync access but cadc has both. The magic of async_to_sync is more trouble than it's worth it. I just realized some mistakes I've made in cadc now that I'm specifically looking at this.

The sync case is simple so I'm not going to discuss it. For the async one, the main difference is that the control returns to the client while the request is being processed but if followed back (and successful) it can return same/similar result. To me, this looks very similar to the stream attribute in requests functions: by default query_xyz should return the results (whether behind the scene is sync or async) but if the user wants more control and service supports async requests, it could return from the method before the request is completed and let user take over (poll the service and access results when done).

With async requests, there's also the question of the request ID. In IVOA, services return a job with an ID. With that job ID, the user can check the status of the job and access the results when job is completed. astroquery might need to recommend a function that takes a job ID and returns job details data associated with it that can be executed any time. In cadc, we expose the pyvo.Job class which works with VO services but probably not with others.

To summarize, if backwards compatibility is not an issue I would get rid of async_to_sync and add the optional async parameter to methods that use async services. Mind you, my experience is mainly with VO services (which BTW could have a default implementation to achieve that.) and I don't know if it works with others.

keflavich commented 1 year ago

Using a kwarg instead of the async-to-sync magic makes sense to me, so if we want to start overhauling services, we can do that.

My vague sense is that the _async methods have not been widely used outside of development and debugging.

We should do a little research on this.

https://github.com/search?q=query_region_async&type=Code turns up almost all hits to astroquery tests in various forks.

query_region is harder to search for, but https://github.com/search?q=simbad.query_region&type=Code turns up several user use-cases, while https://github.com/search?p=2&q=simbad.query_region_async&type=Code is just astroquery forks.

So if we want to make a change, let's spell out the proposed API here and discuss further. I'm hearing: remove _async methods, and instead add async=True/False. For most services, the return will still be a requests response object. Is that correct?

andamian commented 1 year ago

Yes. Missing async or async=False will still return a request.Response. The question is what async=True should return?

For VO services, as I've said, it could return a UWS job. The pattern is well documented.

Do we have any idea/examples of async non-VO services? I just assume that they return a URL where the data is being staged. The URL is inaccessible until data becomes available. At least this is the pattern that ALMA used to have if I remember correctly. If that's the case, the simplest is to assume that this is what the return should be - a response with a redirect URL that the client needs to check for when the results become available. We'll need to figure out if that's possible to achieve with VO services too. With these services we might want to return more details about the jobs for clients that could make use of that info.

keflavich commented 1 year ago

No, if we're talking about a unified method, async=False returns the parsed table, not a response object.

The original idea for async was to enable asynchronous/parallel downloads from normal HTTP web services, not from VO. We only later encountered services that actually had separate query & staging processes, and those were not handled through the async mechanism. besancon is one example, iirc.

ceb8 commented 1 year ago

@keflavich My experience (primarily with MAST, but I've been thinking about the construction while working on the new NEOCC service too) matches with yours in that the async versions have been primarily useful for debugging for modules that don't have a truly asynchronous option.

bsipocz commented 1 year ago

@andamian:

Yet another occasion when I'm running into the async vs sync question:

andamian commented 1 year ago

Not an easy choice, especially with TAP. sync operations are those that return fast (2min is probably a good rule of thumb). More than that and the connection might time out. The async mode with that in mind.

For TAP, the duration of the query is not deterministic, so async might be a safer bet. Besides increased the increased complexity of dealing with jobs, there are other drawbacks: the response time (results are only ready when the job completes and data is staged) and possible restrictions on the size of the staged data.

The client can also implement a hybrid mode, where the API is sync, but underneath uses the service async endpoint to start jobs and periodically check their status until they complete. In this mode they do not have to deal with the jobs.

Does this help at all?

bsipocz commented 1 year ago

OK, gotcha, this helps a lot, thank you. So, based on the archive preference, we may have different defaults whether we default to sync or async (e.g. we may want to do async for irsa). However, I think there is a fundamental agreement that we should move away from duplicating methods (let alone when they are not even real async ones), in favor of a kwarg.