astropy / astroquery

Functions and classes to access online data resources. Maintainers: @keflavich and @bsipocz and @ceb8
http://astroquery.readthedocs.org/en/latest/
BSD 3-Clause "New" or "Revised" License
701 stars 396 forks source link

vectorize astroquery.esa.hsa & HSA.download_data #3004

Open jkrick opened 4 months ago

jkrick commented 4 months ago

I would like to query the Herschel archive for ~thousands of spectra based on position (maybe a million one day??). Right now I have skycoords in a table for my sample, but in order to do HSA.query_hsa_tap() I have to do a for loop over them all. It would be nice if table_upload were supported, or any method of vectorizing that query.

Secondly, to download the data, it would be nice if HSA.download_data were vectorized to handle input of multiple observation_ids.

I see this is a more specific version of #682 . But don't worry, one day I'll ask for vectorizing the other archives too.

keflavich commented 4 months ago

@jkrick does the HSA archive allow multi-position queries? If it does, it is possible to support this - though we'd need help implementing it.

bsipocz commented 4 months ago

cc @jespinosaar

jespinosaar commented 4 months ago

Dear @jkrick , many thanks for your feedback.

I have been checking the options available and, indeed, table_upload is not currently supported. In the Archive UI (https://archives.esac.esa.int/hsa/whsa/) you can see we can upload a list of targets, but in the end what we are doing is simply resolving them, extracting their coordinates and then generating a query with several OR clauses, one for each pair of coordinates.

On the other hand, I really like the idea of vectorizing the methods included in the different modules, but please bear in mind also the limitations on the server and the DB about the length of the queries and the contents of the requests (thinking on millions of different elements). I think that searching for such amount of data is easier if you just execute a for loop over a table of results, so the results are extracted one by one. You can control what is done between each iteration and it is easier and faster for the server to handle small requests.

Please let me know if you have further doubts.

keflavich commented 4 weeks ago

Closing as answered.

In brief: HSA doesn't presently support vectorized queries. The preferred approach is to loop over individual sources; any other approach to vectorization just results in a loop on the server anyway.

bsipocz commented 4 weeks ago

@keflavich - I would still prefer to have an API on our side that accepts vectorized inputs and does the looping or appropriate API calls when available, but it's all transparent to the user (besides it will be of course too slow when it's looping over single objects)