dkirkby / bossdata

Tools for accessing SDSS BOSS data
MIT License
1 stars 3 forks source link

Improve bossquery ascii save format #66

Open dkirkby opened 9 years ago

dkirkby commented 9 years ago

The following query produces output that is difficult to read back in:

% bossquery --where 'PLATE=4979 and MJD=56045 and FIBER between 760 and 765' --what 'RUN,RERUN,CAMCOL' --print
RUN  RERUN CAMCOL
---- ----- ------
2964   301      3
3958   301      3
   0            0
3965   301      2
3958   301      3
3965   301      2

The resulting file is (assuming a .txt or .dat extension) written by astropy.table using:

table.write(args.save, format='ascii')

and contains:

RUN RERUN CAMCOL
2964 301 3
3958 301 3
0  0
3965 301 2
3958 301 3
3965 301 2

Unfortunately, Table.read(..., format='ascii') is unable to read this back because it reasonably deduces that the second column is an integer when it is actually a string.

dkirkby commented 9 years ago

If we decide to go with something like CSV, the same change should probably be made to the default output from bossdata ... --save-data ....

dcunning11235 commented 9 years ago

There doesn't seem to be an option I can find to simply force quoting of all text data. A hacky solution with some drawbacks:

On output, insert

for i in table.columns:
    if str(table[i].dtype)[0:2] == '|S':
        table[i].format = '\'%s\''

before the current

table.write(args.save, format='ascii')

All strings get single quoted. But then reading in has to be changed to:

table = astropy.table.Table.read(..., format='ascii', quotechar='\'')

This last bit is annoying, and working around the API rather than through it...

dkirkby commented 9 years ago

We ideally want the default format to be good compromise between human- and machine-readable. If we only had to deal with numeric values, the existing format would be ok. Perhaps a standard separator (i.e., CSV) is a better compromise than quoting all strings if that requires special reader args or unquoting logic (especially if the string has embedded quotes...) I wonder how CSV handles strings with embedded commas?