astropy / astroquery

Functions and classes to access online data resources. Maintainers: @keflavich and @bsipocz and @ceb8
http://astroquery.readthedocs.org/en/latest/
BSD 3-Clause "New" or "Revised" License
697 stars 396 forks source link

Slow Simbad.query_objects & IRSA.query_region searches #3025

Open ericasaw opened 3 months ago

ericasaw commented 3 months ago

Hi! I have what might be considered an unusual use case for astroquery--I cross match (~20,000) objects with different catalogs using the Simbad, IRSA, and Xmatch queries for an instrument archive. I wrote code that completed all of this cross-matching for me several years ago and have been using it to update the archive I manage since then. I recently updated my environment moving astroquery to the newest 4.7 release, but my prior code doesn't work like it used to in astroquery 4.3 and I was wondering if something changed.

In particular:

I do feel like the Xmatch function has sped up significantly since astroquery 4.3 which I love! I was just wondering if there were any changes made there that could have affected the Simbad and IRSA search functions.

bsipocz commented 3 months ago

I would suggest separating these into two different issues, one for simbad and one for irsa. If possible including code examples, too as that would help any debugging/benchmarking as well as that way we can spot if something is used in a non-intended way (and thus can improve the docs to point out what not to do)

I can say for irsa, that we totally switched out the backend, but not much has changed in the method's code, but a lot could have happened in the past 3 years on server side, etc. So an example code would also help us narrow down the problem to a useful suggestion (as e.g. new methods has been added since then)

ManonMarchand commented 2 months ago

On the SIMBAD part

If I assume that you want the list of identifiers, the main identifier, and the positions for your 2MASS objects, then the proper way to do your query for now is with a TAP query (in the next astroquery version, this will be used behind the scenes by query_objects).

Let's first generate a sample of 10k 2MASS identifiers:

# let's get 10000 random 2MASS objects
from astroquery.simbad import Simbad
query = """SELECT TOP 10000 id from ident
WHERE id like '2MASS%'
"""
random_2MASS = Simbad.query_tap(query)
print(random_2MASS)
           id          
-----------------------
2MASS J00000002+7417074
2MASS J00000007-0529397
2MASS J00000007-3044366
2MASS J00000009-5455467
2MASS J00000011+0522500
2MASS J00000014+6055141
2MASS J00000015-2913020
2MASS J00000016+3208474
2MASS J00000019-1924498
2MASS J00000021+0105203
2MASS J00000022-3008557
2MASS J00000023-5709445
2MASS J00000024-5742487
2MASS J00000025+5210402
2MASS J00000025-7541166
2MASS J00000026-3441523
.
.
.

This part will be skipped for you, as you already have your own list. But you should have an astropy table with a single column with your own sample (if there are more columns you will loose upload time when we send the table to SIMBAD)

We will now write the TAP query:

query = """SELECT main_id, ra, dec, ids 
FROM random_2MASS 
JOIN ident ON ident.id = random_2MASS.id
JOIN basic ON basic.oid = ident.oidref 
JOIN ids ON basic.oid = ids.oidref 
"""

result = Simbad.query_tap(query, random_2MASS=random_2MASS)
<Table length=10000>
        main_id          ...
                         ...
         object          ...
------------------------ ...
        UCAC4 822-000001 ...
               HD 224701 ...
              CTLGD 2509 ...
   GES J00000009-5455467 ...
        UCAC4 477-000001 ...
        UCAC4 755-000001 ...
              CTLGD 9869 ...
   ATO J000.0007+32.1464 ...
        UCAC4 353-000001 ...
               HD 224700 ...
              CTLGD 5514 ...
        UCAC4 165-000001 ...
        UCAC4 162-000001 ...
         TYC 3258-1994-1 ...
        UCAC4 072-000001 ...
          TYC 6992-893-1 ...
   ATO J000.0011+31.2017 ...
.
.
.

It took 5.2 seconds on my machine.

Query explanation

We select

You could chose more columns from Simbad.list_columns().

The random_2MASS is our astropy table that we sent to SIMBAD's servers. It has to be joined to the tables containing the columns we want :

See this help page for more explanation.

An other possible speed-up for you is to be sure that you use the SIMBAD mirror closer to you (there is one in Europe and one in the USA).

On Xmatch

@fxpineau : you have a happy user :slightly_smiling_face:

ericasaw commented 2 months ago

@ManonMarchand Thank you for the SIMBAD example! I've never used the tap search function before since query_objects has always worked for me up until now so this is super helpful :-)

@bsipocz Here is an example for the IRSA behavior I'm noticing (particularly for the name matching using IRSA.query_region where it still seems to be using a coordinate match rather than searching using the 2MASS identifier)

These are a few example 2MASS identifiers I have noticed the behavior for: 2MASS J21065473+3844265, 2MASS J21065341+3844529, 2MASS J11052903+4331357, 2MASS J05420897+1229252, 2MASS J23055131-3551130

If you run the following code:

from astroquery.ipac.irsa import Irsa
import astropy.units as u

#this is just one of the example names
result = Irsa.query_region('2MASS J21065473+3844265', catalog="fp_psc", radius=5 * u.arcsec)

result turns up as an astropy table with no entries.

If instead you expand the radius to 10 arcseconds using the same code above, the appropriate object is found. Perhaps I am making the same mistake here as I was with SIMBAD as @ManonMarchand pointed out and instead I should be using a TAP query?

As for the time, I used IRSA.query_region to look for 16,055 objects in a loop one by one (the 16,055 is not a unique list, there are some objects repeated multiple times) which took 13 hours to run. Granted there are a few other things happening in the loop (saving the results table to a dictionary and printing out a progress report for the loop) so that is likely an exaggerated run time, but still the querying takes much longer than in astroquery 4.3.

The loop looks like this:

from astroquery.ipac.irsa import Irsa
Irsa.TIMEOUT = 3600
from termcolor import colored
import astropy.units as u

#for the objects with found 2MASS names search for them in the IRSA catalog
results = {}
i = 0
for name in names_2mass:
    #5 arcsec is the size of the IGRINS slit, 10 arcsec is required to search the names well
    result = Irsa.query_region(name, catalog="fp_psc", radius=10 * u.arcsec)
    #if there is a result returned
    if len(result) > 0:
        #if the result is multiple objects, keep the one closest in distance
        if len(result) > 1:
            results[has_2mass[i]] = result.to_pandas().head(1)
        #save the results df to a dictionary for later
        else:
            results[has_2mass[i]] = result.to_pandas()
    #if the name search doesnt return an object, print the object name
    else:
        print(colored(f"FAILED {name}", 'light_red'))
    #update the terminal with loop progress
    print(colored(f"{i+1}", 'magenta'), colored(f"/ {len(names_2mass)}", 'light_blue')) 
    i += 1
ericasaw commented 2 months ago

I spent some time this afternoon looking into this and it seems like the Irsa.query_region function in 4.7 builds a TAP query based on input coordinates (which I guess come from the 2MASS identifier name) and then uses the Irsa.query_tap function to look for the object within a specified radius. It's still unclear to me why the TAP query doesn't return the object as expected, maybe it is the type of shape I choose to query with (cone)?

Looking through the IRSA VO Table Access Protocol (TAP) Instructions there is no way to TAP query by name as there is for SIMBAD, which is kind of frustrating. I think that the old Irsa.query_region function in 4.3 worked via requests but also seems to have used coordinates instead of names? Looking at the IRSA Catalog Search Service Application Program Interface it looks like you can feed in names, but still the search seems to use coordinates even if the name is given.

My guess is that the search result now is slower than in astroquery 4.3 due to the response time of IRSA. Based on my experience with how fast the SIMBAD.query_tap function this afternoon (which is very fast) it is interesting to me how slow the Irsa.query_tap function seems to work (behind the scenes of Irsa.query_region). I'm not sure if it is worth the time for me to go through and build a ADQL query for all of the objects since that is basically what Irsa.query_region does anyway.

ManonMarchand commented 2 months ago

Perhaps I am making the same mistake here as I was with SIMBAD as @ManonMarchand pointed out

Sorry that I made it sound like a mistake, query tap is new since astroquery 0.4.7 for Simbad.

aoberto commented 2 months ago

If we want to dig a bit more in the SIMBAD time issue using query_objects, it will be better having more details on selected columns in the output and list of example names. I just tried 5000 object names, 2MASS or not, in SIMBAD or not, and it tooks about 30s. But as the new version of astroquery.simbad is in the way to be released, may be it is not so necessary to dig here.