astropy / pyvo

An Astropy affiliated package providing access to remote data and services of the Virtual Observatory (VO) using Python.
https://pyvo.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
75 stars 52 forks source link

UnicodeDecodeError, ascii codec can't decode byte #346

Open retifrav opened 2 years ago

retifrav commented 2 years ago

OS: Mac OS 12.5 Python: 3.9.13 PyVO: 1.3

PyVO raises an exception with the following query:

import pyvo

service = pyvo.dal.TAPService("https://gea.esac.esa.int/tap-server/tap")
results = service.search(
    " ".join((
        "SELECT table_name, description",
        "FROM tap_schema.tables",
        "WHERE table_name = 'gaiadr2.dr1_neighbourhood'"
    ))
)

The exception:

Traceback (most recent call last):
  File "/tmp/pyvo-encoding.py", line 4, in <module>
    results = service.search(
  File "/usr/local/lib/python3.9/site-packages/pyvo/dal/tap.py", line 246, in run_sync
    return self.create_query(
  File "/usr/local/lib/python3.9/site-packages/pyvo/dal/tap.py", line 942, in execute
    return TAPResults(self.execute_votable(), url=self.queryurl, session=self._session)
  File "/usr/local/lib/python3.9/site-packages/pyvo/dal/query.py", line 245, in execute_votable
    raise DALFormatError(e, self.queryurl)
pyvo.dal.exceptions.DALFormatError: UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 1317: ordinal not in range(128)

If I query the same thing with a bare cURL:

$ curl -X "POST" "https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=JSON&QUERY=SELECT%20table_name,%20description%20FROM%20tap_schema.tables%20WHERE%20table_name%20=%20%27gaiadr2.dr1_neighbourhood%27"

then the result is the following:

{
    "metadata":
    [
        {
            "name": "table_name",
            "datatype": "char",
            "xtype": null,
            "arraysize": "*",
            "description": null,
            "unit": null,
            "ucd": null,
            "utype": null
        },
        {
            "name": "description",
            "datatype": "char",
            "xtype": null,
            "arraysize": "*",
            "description": null,
            "unit": null,
            "ucd": null,
            "utype": null
        }
    ],
    "data":
    [
        [
            "gaiadr2.dr1_neighbourhood",
            "Users wishing to look up the DR2 record for an astrophysical source\nidentified in DR1 must NOT simply extract the record from DR2 having the\nsame source identifier.\n\nAs described in the detailed description of attribute designation in\nGaiaSource it is not guaranteed that the same astronomical source will\nalways have the same source identifier in different Data Releases. Hence\nthe only safe way to compare source records between different Data\nReleases in general is to check the records of proximal source(s) in the\nsame small part of the sky. This table provides the means to do this via\na precomputed crossmatch of such sources, taking into account the proper\nmotions available at DR2.\n\nWithin the neighbourhood of a given DR2 source there may be none, one or\n(rarely) several possible counterparts in DR1 indicated by rows in this\ntable. This occasional source confusion was introduced during the DR1\nprocessing which used an earlier version of the software for matching of\ntransit observations to unique astrophysical sources. The subsequent\nmerging, splitting and deletion of identifiers introduced at DR1 during\nthe DR2 processing means there is no guaranteed one–to–one\ncorrespondence in source identifiers between the releases.\n\nFor more details of the procedure used to create this crossmatch, see\nSection [chap:xmdr1] in the online documentation."
        ]
    ]
}

You can't see it here, as the symbol is kind of invisible, but the problematic place does have this <0xa0> symbol in Section<0xa0>[chap:xmdr1], which I guess is what is causing the problem.

If TAP specification dictates all the services to return results in ASCII only, then I'd say it's certainly the Gaia service fault and not PyVO, but even then I'd say it would be useful to be able to specify the encoding for reading the results (I'm assuming that the same string would read fine with UTF-8), as right now it seems to be "hardcoded" to ASCII.

msdemlei commented 2 years ago

On Thu, Aug 04, 2022 at 07:01:47AM -0700, retif wrote:

You can't see it here, as the symbol is kind of invisible, but the problematic place does have this <0xa0> symbol in Section<0xa0>[chap:xmdr1], which I guess is what is causing the problem.

Yeah, non-breaking space.

If TAP specification dictates all the services to return results in

It's not TAP directly, it's VOTable that still says, in effect, "char is ASCII only". What you see here is (I say that without actually having followed it) an exception re-raised from within the deep bowels of Astropy's VOTable parser.

I give you the message could be a bit more graceful ("The operators have stuck non-ASCII into a char field. Scold them") -- or we could do client side what DaCHS does server-side in such cases: replace all non-ASCII with question marks. Perhaps we should raise a bug against Astropy regarding that? I don't think a patch to that effect would be hard to do, except that we have >= 3 serialisations that behaviours of which would have to be kept in sync.

On the VO side, we could also finally decree that VOTable char should allow UTF-8, which has been brought up in the IVOA now and then (cf. http://mail.ivoa.net/pipermail/apps/2014-August/000968.html ff). I'd argue in favour of it again if someone brought it to the apps list. After so many years, I reckon the opponents may have reconsidered.

Be that as it may: pyVO can do nothing about it. ESAC needs to fix their service, either making sure there's only ASCII in their descriptions or perhaps declaring their description column unicodeChar(). I think that would be ok by TAP, which says description must have the data type "string"; to explain that, it says "implementers may choose an appropriate data type that behaves the same way in queries and output (e.g. varchar(16) or varchar(64) for string...)". I'd read that as including unicodeChar(), and I'd be surprised if anything in pyVO had a problem with that.

Who will take that to ESAC?

bsipocz commented 1 year ago

@jespinosaar - pinging you as this non-ascii character in the response belongs upstream either to ESAC or, if you need it for the server side, to push the change in the standard through IVOA.

jespinosaar commented 1 year ago

Hi @bsipocz , we will check what is happening, thanks for reporting!

cosmoJFH commented 3 months ago

Thank you @retifrav for this information. If you open the json file (the one downloaded by curl) you can show the offending character, if you open it with vi and set the following properties

:set listchars=nbsp:×,tab:\ \ ,trail:\ , :set list

image

We have checked that the following catalogues would generate the same error:

  1. gaiadr2.dr1_neighbourhood
  2. gaiadr3.apassdr9_join
  3. gaiadr3.commanded_scan_law
  4. gaiadr3.dr2_neighbourhood
  5. gaiadr3.frame_rotator_source
  6. gaiadr3.tycho2tdsc_merge
  7. gaiaedr3.apassdr9_join
  8. gaiaedr3.commanded_scan_law
  9. gaiaedr3.dr2_neighbourhood
  10. gaiaedr3.frame_rotator_source
  11. gaiaedr3.tycho2tdsc_merge