astropy / astropy

Astronomy and astrophysics core library
https://www.astropy.org
BSD 3-Clause "New" or "Revised" License
4.39k stars 1.75k forks source link

ECSV should not read whole file to determine if file is a valid ECSV file #7976

Open astrofrog opened 5 years ago

astrofrog commented 5 years ago

At the moment it looks like the ECSV reader in io.ascii reads the whole file before checking the first line contains # %ECSV <version>. For example, when reading in a 200Mb file (which is not an ECSV file), the following takes almost 2 seconds:

read('big.csv', format='ecsv')
---------------------------------------------------------------------------
InconsistentTableError                    Traceback (most recent call last)
<timed eval> in <module>()

~/Dropbox/Code/Astropy/astropy/astropy/io/ascii/ui.py in read(table, guess, **kwargs)
    402         else:
    403             reader = get_reader(**new_kwargs)
--> 404             dat = reader.read(table)
    405             _read_trace.append({'kwargs': copy.deepcopy(new_kwargs),
    406                                 'Reader': reader.__class__,

~/Dropbox/Code/Astropy/astropy/astropy/io/ascii/core.py in read(self, table)
   1161 
   1162         # Get the table column definitions
-> 1163         self.header.get_cols(self.lines)
   1164 
   1165         # Make sure columns are valid

~/Dropbox/Code/Astropy/astropy/astropy/io/ascii/ecsv.py in get_cols(self, lines)
    117 
    118         if not lines:
--> 119             raise core.InconsistentTableError(no_header_msg)
    120 
    121         match = re.match(ecsv_header_re, lines[0].strip(), re.VERBOSE)

InconsistentTableError: ECSV header line like "# %ECSV <version>" not found as first line.  This is required for a ECSV file.

We should be able to speed this up significantly by reading just the first line to check this.

Note that this is different from https://github.com/astropy/astropy/issues/7977 - here I'm suggesting that readers should be able to do simple checks such as literally just reading the first one or two lines, or doing some other arbitrary checks, before proceeding. The issue here is also independent of the guessing infrastructure. Even when specifying the format, the above takes a while.

astrofrog commented 5 years ago

@taldcroft - this seems to be something that applies to other readers too, for example:

 {'kwargs': {'Reader': astropy.io.ascii.fixedwidth.FixedWidthTwoLine,
   'fast_reader': {'enable': True},
   'fill_values': [('', '0')],
   'strict_names': True},
  'status': 'InconsistentTableError: Position line should only contain delimiters and one other character, e.g. "--- ------- ---".',
  'dt': '1779.793 ms'},

Same goes for e.g. IPAC, where reading one line would be sufficient to trigger an error in this case.

Just to be clear, I realize that for some formats the whole file will need to be read, but there are some formats where the first line or two are enough to trigger an error. So is the issue that the lines are all read in immediately rather than having some kind of generator?

Can we have some kind of arbitrary pre-checks that readers can choose to do before parsing/splitting all the lines?

taldcroft commented 5 years ago

Yes, in order to make the code structured and pluggable, the high-level code reads in the entire file, then lets the individual reader classes use that as needed. For instance here: https://github.com/astropy/astropy/blob/94e17bc63f5de21c84605a3f495ec1733fb30467/astropy/io/ascii/ui.py#L344 This is part of code that normalizes the possible inputs into what the class readers need (the actual table lines), and it occurs before any idea of which classes are being tried during guessing. Remember one can pass a variety of inputs to the reader, not just a filename or file obj.

So agreed it would be good to make an improvement here, but I think it might require non-trivial structural changes to do nicely. At the very least that statement reading in the whole file early in the process needs to be deferred.

astrofrog commented 5 years ago

@taldcroft - ah I see - I think there are potentially two separate things that can be optimized here, because even if I pre-read the whole file into a string, and pass the string to read:

>>> read(content, format='ecsv')

It still takes >1 second. In fact I just checked and the I/O part is not what dominated by previous issue. So there is still room for optimization even if we don't change anything to the actual file reading, since there is a bottleneck even when a string is passed.

astrofrog commented 5 years ago
   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1424580    0.479    0.000    0.479    0.000 {method 'match' of 're.Pattern' objects}
        1    0.434    0.434    0.434    0.434 {method 'splitlines' of 'str' objects}
        1    0.390    0.390    1.352    1.352 core.py:711(<listcomp>)
  1424578    0.342    0.000    0.483    0.000 core.py:708(<genexpr>)
        1    0.171    0.171    0.221    0.221 data.py:110(get_readable_fileobj)
  1424578    0.142    0.000    0.142    0.000 {method 'strip' of 'str' objects}
        1    0.030    0.030    1.382    1.382 core.py:715(get_data_lines)
      931    0.030    0.000    0.030    0.000 {method 'get' of 'dict' objects}
        1    0.026    0.026    2.087    2.087 <string>:2(<module>)
        1    0.022    0.022    1.839    1.839 core.py:1121(read)
        1    0.020    0.020    0.050    0.050 parse.py:361(urlparse)