Open bsipocz opened 1 year ago
alma
has only sync access but cadc
has both. The magic of async_to_sync
is more trouble than it's worth it. I just realized some mistakes I've made in cadc
now that I'm specifically looking at this.
The sync
case is simple so I'm not going to discuss it. For the async
one, the main difference is that the control returns to the client while the request is being processed but if followed back (and successful) it can return same/similar result. To me, this looks very similar to the stream
attribute in requests
functions: by default query_xyz
should return the results (whether behind the scene is sync
or async
) but if the user wants more control and service supports async
requests, it could return from the method before the request is completed and let user take over (poll the service and access results when done).
With async
requests, there's also the question of the request ID. In IVOA, services return a job with an ID. With that job ID, the user can check the status of the job and access the results when job is completed. astroquery
might need to recommend a function that takes a job ID and returns job details data associated with it that can be executed any time. In cadc
, we expose the pyvo.Job
class which works with VO services but probably not with others.
To summarize, if backwards compatibility is not an issue I would get rid of async_to_sync
and add the optional async
parameter to methods that use async services. Mind you, my experience is mainly with VO services (which BTW could have a default implementation to achieve that.) and I don't know if it works with others.
Using a kwarg instead of the async-to-sync magic makes sense to me, so if we want to start overhauling services, we can do that.
My vague sense is that the _async
methods have not been widely used outside of development and debugging.
We should do a little research on this.
https://github.com/search?q=query_region_async&type=Code turns up almost all hits to astroquery tests in various forks.
query_region is harder to search for, but https://github.com/search?q=simbad.query_region&type=Code turns up several user use-cases, while https://github.com/search?p=2&q=simbad.query_region_async&type=Code is just astroquery forks.
So if we want to make a change, let's spell out the proposed API here and discuss further. I'm hearing: remove _async
methods, and instead add async=True/False
. For most services, the return will still be a requests
response object. Is that correct?
Yes. Missing async
or async=False
will still return a request.Response
. The question is what async=True
should return?
For VO services, as I've said, it could return a UWS job. The pattern is well documented.
Do we have any idea/examples of async non-VO services? I just assume that they return a URL where the data is being staged. The URL is inaccessible until data becomes available. At least this is the pattern that ALMA used to have if I remember correctly. If that's the case, the simplest is to assume that this is what the return should be - a response with a redirect URL that the client needs to check for when the results become available. We'll need to figure out if that's possible to achieve with VO services too. With these services we might want to return more details about the jobs for clients that could make use of that info.
No, if we're talking about a unified method, async=False
returns the parsed table, not a response
object.
The original idea for async
was to enable asynchronous/parallel downloads from normal HTTP web services, not from VO. We only later encountered services that actually had separate query & staging processes, and those were not handled through the async
mechanism. besancon
is one example, iirc.
@keflavich My experience (primarily with MAST, but I've been thinking about the construction while working on the new NEOCC service too) matches with yours in that the async versions have been primarily useful for debugging for modules that don't have a truly asynchronous option.
@andamian:
Yet another occasion when I'm running into the async vs sync question:
Not an easy choice, especially with TAP. sync
operations are those that return fast (2min is probably a good rule of thumb). More than that and the connection might time out. The async mode with that in mind.
For TAP, the duration of the query is not deterministic, so async
might be a safer bet. Besides increased the increased complexity of dealing with jobs, there are other drawbacks: the response time (results are only ready when the job completes and data is staged) and possible restrictions on the size of the staged data.
The client can also implement a hybrid mode, where the API is sync, but underneath uses the service async endpoint to start jobs and periodically check their status until they complete. In this mode they do not have to deal with the jobs.
Does this help at all?
OK, gotcha, this helps a lot, thank you. So, based on the archive preference, we may have different defaults whether we default to sync or async (e.g. we may want to do async for irsa). However, I think there is a fundamental agreement that we should move away from duplicating methods (let alone when they are not even real async ones), in favor of a kwarg.
While I was reviewing https://github.com/astropy/astroquery/pull/2597, I run into this quasi-dilemma, but I raise it as a separate question for @keflavich and @andamian, as well as @ceb8.
We have a lot of historical baggage around our API, and I would like to revisit some of it to see whether they still make sense, or we should slowly deprecate them out. At some point we need to switch versioning anyway, so one big breath and breaking API doesn't feel off the table.
So, the question for async and sync jobs: we traditionally have
method_async
andmethod
, auto-generated withasync_to_sync
. But some modules don't follow this, and some modules manually have duplicated methods. So I wonder, now as we move on towards using proper VO tools, and pyvo, whether we want to revisit our API preferences, and would rather go with this kwarg-driven approach? Or the method duplication is better? (And frankly, it's baked into the API too much already that if it's the same, we could just stick with it).I very much interested in the take from Adrian, whether it made any sense to you while redoing e.g.
alma
, or it was something to go along with?