django-daiquiri / daiquiri

A framework for the publication of scientific databases
https://escience.aip.de/daiquiri
Apache License 2.0
24 stars 8 forks source link

IMPROVEMENT: ConeSearch #214

Open agy-why opened 1 year ago

agy-why commented 1 year ago

With daiquiri there are 4 ways of running a cone search:

  1. Via IVOA SOAP interface: ConeSearch
  2. Via Web Interface (in app query, JS)
  3. Via TAP with PostgreSQL
  4. Via TAP with ADQL

All of these ways lead to various performances and/or results.

IVOA SOAP

https://github.com/django-daiquiri/daiquiri/blob/master/daiquiri/conesearch/adapter.py#L17

The data access is done directly via the DatabaseAdapter, with a query defined in the conesearch adapter. Therefore the performances and results will depend on the pre-defined query.

Web Interface

https://github.com/django-daiquiri/daiquiri/blob/master/daiquiri/query/static/query/js/forms/cone.js

The data access is done via the REST API of the query-app. The query is defined within the JS files, and may differ from the IVOA Soap query. Again performances and results will depend on the query.

Note: per default only ra, dec, and search radius are available parameter, the official verb (the list of columns) and resource (the table to query) values are not available, but can be added within the adapter.

TAP PostgreSQL

The TAP service also allow the user to submit queries carrying out a cone search. In PostgreSQL they look like:

SELECT id, ra, dec, <OTHER-COL>,
       DEGREES( SPOINT( RADIANS(ra), RADIANS(dec)) <-> SPOINT(RADIANS(<RA>), RADIANS(<DEC>))) as adist
  FROM <RESOURCE>
 WHERE SPOINT( RADIANS(ra), RADIANS(dec)) @ SCIRCLE(SPOINT( RADIANS(<RA>), RADIANS(<DEC>)), RADIANS(<SRADIUS>))

This type of cone search are quite slow, and will always require LONG Queue. For this reason we usually provide an additional column for the conesearchable tables: pos. This column is of type SPOINT which is not a native PostgreSQL type but is declared by the Postgres module PGIST. When the pos column is indexed, one can build very efficient queries like:

SELECT id, ra, dec, <OTHER-COL>,
       DEGREES( pos <-> SPOINT(RADIANS(<RA>), RADIANS(<DEC>))) as adist
  FROM <RESOURCE>
 WHERE pos @ SCIRCLE(SPOINT( RADIANS(<RA>), RADIANS(<DEC>)), RADIANS(<SRADIUS>))

From experience, this queries will run about 20 time faster than the precedent.

TAP ADQL

 SELECT source_id, ra, dec, <OTHER-COL>,
        DISTANCE( POINT('ICRS', ra, dec), POINT('ICRS', <RA>, <DEC>) AS dist
   FROM <RESOURCE>
  WHERE 1 = CONTAINS( POINT('ICRS', ra, dec), CIRCLE('ICRS', <RA>, <DEC>, <SRADIUS>))

This will be translated into PostgreSQL as:

SELECT id, ra, dec, <OTHER-COL>,
       DEGREES( SPOINT( RADIANS(ra), RADIANS(dec)) <-> SPOINT(RADIANS(<RA>), RADIANS(<DEC>))) as adist
  FROM <RESOURCE>
 WHERE SPOINT( RADIANS(ra), RADIANS(dec)) @ SCIRCLE(SPOINT( RADIANS(<RA>), RADIANS(<DEC>)), RADIANS(<SRADIUS>))

Which is an issue since it does not make use of the indexed pos columns, and lead to much longer query time.

Issues

  1. There are two access points for the IVOA Soap and Web ConeSearch, leading to different results.
  2. The Performance of the TAP ConeSearch depends on the language used, and the use of the pos column.

Possible solutions

  1. Unify the access point to science data: see #213
  2. In case a pos column is available improve translation from ADQL to Postgres and replace SPOINT( RADIANS(ra), RADIANS(dec)) by pos.
agy-why commented 1 year ago

The translation of a query is done here: https://github.com/django-daiquiri/daiquiri/blob/master/daiquiri/query/process.py#L123

at this place: https://github.com/django-daiquiri/daiquiri/blob/master/daiquiri/query/models.py#L150

The basic idea to catch the possibility of adapting the query to the specific DB, like in the case of a pos column would be to add a method called: optimize_for_db that takes translated_query as argument and uses settings to optimize the query.

In the case of cone search we could imagine a declaration of optimizable tables (the ones with a pos column) and a mapping between the name of the table and its pos attributes (name of the ra and dec columns).

CONESEARCH_TABLES = [
    {"table_name": "gaiadr3"."gaia_source",
     "ra_col": "ra",
     "dec_col": "dec",
    },
    ...
]

This could be then used alike:

            # translate the query from adql
            translated_query = translate_query(self.query_language, self.query)

            # log the translated query to the debug log
            logger.debug('translated_query = "%s"', translated_query)

            # optimize the query
            optimized_query = optimize_query(self.query)

            processor = process_query(optimized_query)

With

            def optimize_query(query):
                '''Optimize a translated query to better use the features of the Database
                '''
                for table in query.tables:
                    if table in [table['table_name'] for table in CONESEARCH_TABLES]:
                        ra_col = table['ra_col']
                        dec_col = table['dec_col']
                        if "SPOINT(RADIANS(" + ra_col + "), RADIANS(" + dec_col + "))" in query.query_string:
                             replace(query, "SPOINT(RADIANS(" + ra_col + "), RADIANS(" + dec_col + "))", "pos")

Obviously one need to be careful in case of table renaming: FROM table as t1 and other details of the implementation, this suggestion is surely not the best one.