frictionlessdata / tabulator-py

Python library for reading and writing tabular data via streams.
https://frictionlessdata.io
MIT License
235 stars 42 forks source link

Rebase TSV parser on CSV parser #338

Closed cschloer closed 3 years ago

cschloer commented 3 years ago

Overview

test.py

from dataflows import Flow, load

file_path = "/path/to/test.tsv"
flows = [load(file_path, name="res", format="tsv", skip_rows=["#"])]
print(Flow(*flows).results())

with file test.tsv:

#  This is a comment
#  
Lat Lon
33.6062 -117.9312
33.6062 -117.9312
33.6062 -117.9312

I get the error:

Traceback (most recent call last):
  File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tabulator/stream.py", line 757, in __extract_sample
    row_number, headers, row = next(self.__parser.extended_rows)
  File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tabulator/parsers/tsv.py", line 65, in __iter_extended_rows
    for row_number, item in enumerate(items, start=1):
  File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tsv.py", line 51, in un
    if check_line_consistency(columns, values, i, error_bad_lines):
  File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tsv.py", line 84, in check_line_consistency
    raise ValueError(message)
ValueError: Expected 1 fields in line 3, saw 2

It seems like the TSV parser strictly sets the the number of fields allowed when it is initialized (https://github.com/frictionlessdata/tabulator-py/blob/master/tabulator/parsers/tsv.py#L63). Since the first item in this file is a comment with no tabs, it errors when a line shows up with a seemingly larger number of fields.

I would fall back to just using the CSV module and use \t as the delimiter (https://stackoverflow.com/questions/42358259/how-to-parse-tsv-file-with-python) but I keep getting the error "delimiter" must be a 1-character string - not sure if that a result of custom code or not.


Please preserve this line to notify @roll (lead of this repository)

roll commented 3 years ago

Hi @cschloer,

I can't reproduce it:

from tabulator import Stream

with Stream('tmp/issue338.tsv', headers=1, format='csv', skip_rows=['#'], delimiter='\t') as stream:
    print(stream.headers)
    print(stream.read())
# ['Lat', 'Lon']
# [['33.6062', '-117.9312'], ['33.6062', '-117.9312'], ['33.6062', '-117.9312']]
cschloer commented 3 years ago

So the original issue (not the workaround) would be reproduced as such (using format tsv):

from tabulator import Stream

with Stream('tmp/issue338.tsv', headers=1, format='tsv', skip_rows=['#']) as stream:
    print(stream.headers)
    print(stream.read())

I'm unable to reproduce my own issue with the ""\t" with dataflows and standard load processor, but I think this bug still exists (with the tsv processor).

cschloer commented 3 years ago

The 1 character string issue might be an issue with me upgrading to python 3.8 or something...

Looking at the docs it actually does specify thaet it should be a 1 character string

https://docs.python.org/3/library/csv.html#csv.Dialect.delimiter

roll commented 3 years ago

@cschloer I see. The underlying TSV library is not really developed so I think we need to switch TSV to Python CSV parsing. For now, I would recommend using csv format.

roll commented 3 years ago

MEGED into https://github.com/frictionlessdata/frictionless-py/issues/398

cschloer commented 3 years ago

Just to follow back on this, I realized that some front end library I was using was changing "\t" to "\\t" before making the request to the server. Just a note that \t is now working, but it is still not possible to delimit on a mulitcharacter string.