astropy / astropy

Astronomy and astrophysics core library
https://www.astropy.org
BSD 3-Clause "New" or "Revised" License
4.41k stars 1.76k forks source link

ASCII file formats. #3025

Closed demitri closed 9 years ago

demitri commented 10 years ago

I was just working with someone who wanted to read an ASCII file and convert to a FITS file. Based on the documentation here, he created a file that started with this (more or less):

   RA        Dec   
-------- ----------
0.000226   8.371668
0.000864  61.252617
0.000929 -19.413921

The code was basically:

from astropy.table import Table
t = Table.read('test.dat', format='ascii')
t.write('test.fits',format='fits')

This was the resulting FITS file

mail attachment

Questions:

Since loadtxt is extremely configurable, is there a few-line example for how to use that and then convert to an astropy.table.Table object?

I'm trying to understand the problem before suggesting a fix!

embray commented 10 years ago

This is a valid FITS file, but I don't think it's what they wanted, because the columns are being interpreted as, and written back out as string columns. That is, it's interpreting the '----------' lines as part of the data.

I think what you want here is

>>> t = Table.read('table.dat', format='ascii.fixed_width_two_line')
>>> t
<Table rows=3 names=('RA','Dec')>
array([(0.000226, 8.371668), (0.000864, 61.252617), (0.000929, -19.413921)], 
      dtype=[('RA', '<f8'), ('Dec', '<f8')])

But also I can see here that there's a potential improvement in the guessing heuristics here.

embray commented 10 years ago

This is also a very useful page in the docs: http://docs.astropy.org/en/v0.4.2/io/ascii/fixed_width_gallery.html#fixed-width-gallery

taldcroft commented 10 years ago

On Fri, Oct 10, 2014 at 5:57 PM, Demitri Muna notifications@github.com wrote:

I was just working with someone who wanted to read an ASCII file and convert to a FITS file. Based on the documentation here http://docs.astropy.org/en/v0.4.2/io/unified.html#ascii-formats, he created a file that started with this (more or less):

RA Dec


0.000226 8.371668 0.000864 61.252617 0.000929 -19.413921

When creating an ASCII table from scratch you are best off using a CSV-type format, probably space-delimited like this:

RA Dec 0.000226 8.371668 0.000864 61.252617 0.000929 -19.413921

The example in the link shows the IPython string representation of a Table object. It's not intended as an example of how to create an ASCII table file, though I can see how a beginning user might go in that direction.

The code was basically:

from astropy.table import Table t = Table.read('test.dat', format='ascii') t.write('test.fits',format='fits')

This was the resulting FITS file

[image: mail attachment] https://cloud.githubusercontent.com/assets/64314/4599516/55573c64-50c7-11e4-92d9-fc6d998e7baa.png

Questions:

  • Is the resulting FITS file valid? The data type of the first row is different than the rest (and certainly not what was intended).

No.

  • The documentation says the ascii format is "ASCII table in any supported format (uses guessing)". Is "supported format" defined somewhere?

Yes: http://docs.astropy.org/en/v0.4.2/io/ascii/index.html#id1

  • Does this mean the code tries to guess which of the formats in the table the file is, or something else? If the former, maybe changing that line in the docs to "ASCII table in any supported format below" might clear it up a bit.

Guessing is explained at: http://docs.astropy.org/en/v0.4.2/io/ascii/read.html#guess-table-format

Perhaps a user like yourself with a fresh eye would be the best to make a docs PR to clarify things. This would be very helpful.

Since loadtxt is extremely configurable, is there a few-line example for how to use that and then convert to an astropy.table.Table object?

There are two ways to directly read the table in io.ascii:

t = Table.read('test.dat', format='ascii', data_start=2) t = Table.read('test.dat', format='ascii.fixed_width_two_line')

See also: http://docs.astropy.org/en/v0.4.2/io/ascii/read.html#parameters-for-read

I'm trying to understand the problem before suggesting a fix!

— Reply to this email directly or view it on GitHub https://github.com/astropy/astropy/issues/3025.

Cheers, Tom

embray commented 10 years ago

I missed the part where the file was created by hand that way in the first place, which is no good. Not that it's a terrible table format, but let's not make more hand-crafted fixed width tables with arbitrary formats if we can avoid it :)

At the same time, it seems to me like a format that should be guessable more easily, though I understand why it doesn't currently work. It's also a little tricky since hyphens are sometimes used to represent missing values (though if they only occur in one row and happen to be the same as the column width I think they could be assumed to be headers of some sort unless otherwise specified).

demitri commented 10 years ago

Right, this ticket was more about the documentation than the specifics of the ASCII parser. In this case, the file was formatted that way based on the documentation.

I think opening door to guessing arbitrary ASCII formats is a way to madness. I'd be more in favor of putting support behind a limited number of formats and having people just conform to that. I'm a big fan of

# any number of comment
# lines here that are ignored
col_name1 col_name2
1234 5678

It's fast to parse, self documented, and I’d think accepted in many different places (e.g. R). I'd rather accept a smaller number and show people how to subclass their own ASCII parser.

It's on my list to come back to this and make a doc pull request to clarify things.

mhvk commented 10 years ago

@demitri - for a format similar to what you would like, see @taldcroft's nice work #2319 and a sample at http://nbviewer.ipython.org/gist/taldcroft/a13b670ab15db5684f49

hamogu commented 9 years ago

@demitri : In fact, io.ascii is doing exactly what you describe as a "way to madness" - by default it tries to guess arbitrary ascii formats and it's actually pretty good at that, but not perfect. Thus, the user can write any ascii file that is "sensible" (space separated, comma separated, separated by |, fixed width tables, CDS tables, LaTeX, ... ), throw that file at io.ascii and it will be read. That is why the documentation that you cited only says "supported format" and is not more specific - going through the documentation you will see a subset of all formats that io.ascii understands and any one of those will be read.

Some formats work better than others, e.g. the first example on the documentation page http://astropy.readthedocs.org/en/latest/io/ascii/index.html has only column names and values (no extra strings like ------)

obsid redshift  X      Y     object
3102  0.32      4167  4085   Q1250+568-A
877   0.22      4378  3892   "Source 82"

but even the example that your user had was read - io.ascii just did not know if ------- is a string value that it should read or if it should ignore this line. So, it did the safe thing and read as much as it could (never throw out a line unless the user says so). As @taldcroft said, in this case, the guessung process can be improved by telling ascii table that the data starts one line below (data_start = 2).

Now, we know that not everybody likes this appraoch and that's fine. For people who don't want io.ascii to guess (or poeple who cannot afford that for performance reasons), there is the parameter guess=False. With guess=False, io.ascii behaves a lot more like the traditional np.readtxt in that you need to manually specify the delimiter between values (default is space), the character that marks comments lines (default is #), etc. - see http://astropy.readthedocs.org/en/latest/io/ascii/read.html#parameters-for-read

As you say, if somebody writes a new ascii file from scratch there might be better and worse ways to do it, but the point of io.ascii is to deal with all of them - because if you can think of any idiosyncratic format, I promise some astronomer will have a file like that on his/her disk.

If you find a better way to express this motivation behind io.ascii in the documentation that would be great because it is quite different from the way most other ascii readers work! Personally, I believe that this is a very powerful feature given the gazillions of ascii table formats that are already out there.

hamogu commented 9 years ago

Just for the record: astropy.io.ascii is considered mature and stable in API, so no major changes in the concept "guess everything" should be done. http://astropy.readthedocs.org/en/latest/stability.html

embray commented 9 years ago

@hamogu Thanks for the excellent synopsis.

My thinking here in how the guessing could be improved is to recognize that if a column contains all numeric values except at the very top, which is something vaguely "horizontal rule"-looking, for example a string the width of the columns of all '-' or '=' or even '_' it could probably be able to guess that that's part of the header. We could throw in all kinds of other heuristics like that too and I don't see any harm in doing so if someone wanted to spend them time (it wouldn't affect the API).

Certainly if someone knows exactly what the table format looks like they should set guess=False and specify all the appropriate parameters (that has the added benefit of possibly catching mistakes in the formatting of your data, for example if someone tweaked something by hand and left it malformatted).

demitri commented 9 years ago

The aim of this ticket was not to address or change the behavior of astropy.io.ascii, it was to request documentation clarification. I would do it myself, but I'm just swamped at the moment (and partly didn't know how it all worked... see point one). Which is not to say I won't.

Reading this thread though I would definitely recommend changing the default value of guess to be false. I think that should be an opt-in thing; the original example above being a good case. If there is guesswork going on, one can imagine a number of cases where a valid result would be guessed, but is not what was intended. The result is that the code executes without error, but the results are not what are expected or assumed. By opting in, the user is explicitly acknowledging the uncertainty (it's a "least surprise" thing).

but even the example that your user had was read - io.ascii just did not know if ------- is a string value that it should read or if it should ignore this line. So, it did the safe thing and read as much as it could (never throw out a line unless the user says so).

Right - safe with respect to the parser, but not what the user intended. There are two domains here (and definitions of the word "safe"!). Making guess default to false places the higher priority on the domain of the user. If I had to put in guess=True in code, the first thing I would do is check to see if the code guessed what I intended. If it did, then I'm happy, it's saved me time, and I can move on.

Personally, I believe that this is a very powerful feature given the gazillions of ascii table formats that are already out there.

I agree 100%! Don't get me wrong - think it's great that the code does its level best to guess at a possible format (I wouldn't have done it!). But how the guessing is done is not what this ticket is about.

My thinking here in how the guessing could be improved is to recognize that if a column contains all numeric values except at the very top, which is something vaguely "horizontal rule"-looking, for example a string the width of the columns of all '-' or '=' or even '_' it could probably be able to guess that that's part of the header.

This falls for me under the principle of "if it's obvious and unambiguous to any astronomer looking at it, the code should handle it". A row with "-" or "----" would be visually interpreted as a divider, so if someone wants to modify the guesser, I'd support that. In another ticket. :)

taldcroft commented 9 years ago

If this is a common enough case we could introduce a new format header_line or something (similar to the current fixed_width_two_line) that requires a header with column names and then a line consisting purely of -, = or whitespace. Putting that high in the guess list should not impact performance too much since it would typically fail out when the second header line isn't found. There is room for some minor additional heuristics in deciding if the header matches, but we could just start with this simple format specification.

embray commented 9 years ago

@taldcroft I think that's probably a good idea. It is after all the default format that we print simple tables in, and several examples in the docs use it (see for example http://docs.astropy.org/en/stable/table/construct_table.html#numpy-structured-array)

I could see users even printing a table to the terminal and then copy-pasting it into a file. Not the right approach of course but doesn't seem unlikely either. In any case I think this format should be guessable.

taldcroft commented 9 years ago

I could see users even printing a table to the terminal and then copy-pasting it into a file.

That's roughly what started this whole thread. See #3099.

hamogu commented 9 years ago

I know @taldcroft commited to more other things in the handout yesterday, so please assign this issue to me. I'll come up with a reader for that before the 1.0 feature freeze.