frictionlessdata / datapackage-py

A Python library for working with Data Packages.
https://frictionlessdata.io
MIT License
191 stars 43 forks source link

Inferring of tab separated resources breaks when using `array` #281

Closed u8sand closed 3 years ago

u8sand commented 3 years ago

Overview

Dialect inferring should do better, in particular the file extension and first line should be very good indications of the dialect, however datapackage gets confused when using arrays in tsv (presumably because of the quotes/commas).

Minimum "broken" example

pkg.json

{
  "profile": "tabular-data-package",
  "name": "datapackage",
  "title": "An example datapackage",
  "resources": [
    {
        "profile": "tabular-data-resource",
        "name": "table",
        "title": "table",
        "path": "table.tsv",
        "description": "A table",
        "schema": {
          "fields": [
              {
                "name": "id",
                "description": "An identifier representing this file, unique within this id_namespace [part 2 of 2-component composite primary key]",
                "type": "string",
                "constraints": {
                    "required": true
                }
              },
              {
                "name": "array",
                "description": "A random array",
                "type": "array"
              }
          ]
        }
    }
  ]
}

table.tsv

id  array
1   ["hello", "world", "!"]

test.py

from datapackage import Package
pkg = Package('pkg.json')
table = pkg.get_resource('table')
table.read()

Traceback

CastError                                 Traceback (most recent call last)
<ipython-input-71-3b457c671ba4> in <module>
      1 from tableschema import Table
      2 table = Table('table.tsv', schema='table.json')
----> 3 table.read()

~/.local/lib/python3.9/site-packages/tableschema/table.py in read(self, keyed, extended, cast, limit, integrity, relations, foreign_keys_values, exc_handler)
    351             relations=relations, foreign_keys_values=foreign_keys_values,
    352             exc_handler=exc_handler)
--> 353         for count, row in enumerate(rows, start=1):
    354             result.append(row)
    355             if count == limit:

~/.local/lib/python3.9/site-packages/tableschema/table.py in iter(self, keyed, extended, cast, integrity, relations, foreign_keys_values, exc_handler)
    213             iterator = self.__apply_processors(
    214                 iterator, cast=cast, exc_handler=exc_handler)
--> 215             for row_number, headers, row in iterator:
    216 
    217                 # Get headers

~/.local/lib/python3.9/site-packages/tableschema/table.py in builtin_processor(extended_rows)
    506             for row_number, headers, row in extended_rows:
    507                 if self.__schema and cast:
--> 508                     row = self.__schema.cast_row(
    509                         row, row_number=row_number, exc_handler=exc_handler)
    510                 yield (row_number, headers, row)

~/.local/lib/python3.9/site-packages/tableschema/schema.py in cast_row(self, row, fail_fast, row_number, exc_handler)
    273                      value)
    274                     for (i, value) in enumerate(row))
--> 275             exc_handler(exc, row_number=row_number, row_data=keyed_row,
    276                         error_data=keyed_row)
    277 

~/.local/lib/python3.9/site-packages/tableschema/helpers.py in default_exc_handler(exc, *args, **kwargs)
     88     """Default exception handler function: raise exc, ignore other arguments.
     89     """
---> 90     raise exc
     91 
     92 

CastError: Row length 2 doesn't match fields count 0 for row "2"

Note that this can be corrected by adding a dialect to the resource (#/resources/0):

{
        "dialect": {
          "delimiter": "\t"
        }
}

But since we're using .tsv and don't have commas in the header, it really seems like this should be inferred. Instead it seems like a "csv" is assumed and you get errors. The error reported makes this very hard to debug in code that worked perfectly fine before trying to use the array type.


Please preserve this line to notify @pwalsh (lead of this repository)

pwalsh commented 3 years ago

Hi @u8sand

If no format is explicitly passed, then, there is no inferring - the file is parsed as csv:

Tabulator, the underlying stream processing library, would need to be passed format as None in order to infer:

So the bug is not that inference should be better, but that all files are parsed as CSV unless explicitly declared otherwise. You should be able to run a quick test to see that is what is happening, by setting format to None and see if Tabulator correctly detects TSV.

u8sand commented 3 years ago

@pwalsh Ah I see, so format was missing all along. Thank you for this; sorry for the invalid issue.

It should be noted, with this new information, that specifying format: null also fixes it suggesting that you're right about passing None along to tabulator. Perhaps this is a more intuitive default.

pwalsh commented 3 years ago

The reason that CSV is the default is because CSV is the required data format for a tabular data package according to the spec. Passing a format gives you a loophole around this.