cldf / csvw

CSV on the web
Apache License 2.0
37 stars 6 forks source link

iterdicts() does not use 'name's or 'titles' #24

Closed alexshpilkin closed 5 years ago

alexshpilkin commented 5 years ago

I‘m observing iterdicts() using the CSV header instead of names or titles. Test casse: test.zip.

>>> from csvw import *
>>> t = TableGroup.from_file('test.tsv-metadata.json')
>>> next(t.tables[0].iterdicts())
OrderedDict([('region', 'Hello'), ('tik', 'world'), ('uik', '1')])

should be

OrderedDict([('province', 'Hello'), ('territory', 'world'), ('precinct', '1')])
xrotwang commented 5 years ago

Ok, here's what's going on: If you want the header in the CSV file ignored, you'd have to specify

            "header": false,
            "skipRows": 1

in the table's dialect. Otherwise iterdicts tries to match the column names in the file header with name or titles in the metadata. The advantage of this is that it allows re-ordering of the columns in the CSV file without invalidating the metadata.

Alternatively, you could specify the column names from the CSV as titles in the metadata. Then they will be renamed to the name property in the resulting dict.

xrotwang commented 5 years ago

Maybe we should emit a warning upon encountering columns in a file which cannot be matched to any column in the metadata. I wouldn't make this an error, though, since a somewhat common use case for me is creating metadata for data files which are not completely under my control - and this more robust, if re-ordering of columns, and addition of non-described columns in the data is possible.

alexshpilkin commented 5 years ago

I’m afraid this matching isn’t actually permitted by the spec.

Model §6.1:

1 Retrieve the metadata file yielding the metadata UM (which is treated as overriding metadata, see § 5.1 Overriding Metadata). [...] 3 For each table (TM) in UM in order, create one or more annotated tables: [...] 3.3 Parse the tabular data file, using DD as a guide, to create a basic tabular data model (T) and extract embedded metadata (EM), for example from the header line. [...] 3.5 Verify that TM is compatible with EM using the procedure defined in Table Description Compatibility in [tabular-metadata]; if TM is not compatible with EM validators MUST raise an error, other processors MUST generate a warning and continue processing. 3.6 Use the metadata TM to add annotations to the tabular data model T as described in Section 2 Annotating Tables in [tabular-metadata].

Note that the last point ignores EM completely.

The definition of compatibility in Vocabulary §5.5.1 specifically includes ordering:

Two schemas are compatible if they have the same number of non-virtual column descriptions, and the non-virtual column descriptions at the same index within each are compatible with each other. [... Goes on to require matching of titles when specified in both schemas.]

alexshpilkin commented 5 years ago

In any case, the (non-normative) spec for parsing CSV in Model §8 says one is supposed to treat the header as containing titles, not names:

7.3.2.2 Otherwise, if there is no column description object at index i in M.tableSchema.columns, create a new one with a title property whose value is an array containing a single value that is the value at index i in the list of cell values.

Schema compatibility (see above) then says one is supposed to match two column descriptions when one only contains a title and another only a name.

Column descriptions are compatible under the following conditions:

  • [...]
  • If not validating, and one schema has a name property but not a titles property, and the other has a titles property but not a name property.
xrotwang commented 5 years ago

Hm. Maybe one could implement some sort of strict mode. I wouldn't want to lose the flexibility of the current system, though, because as I said, being able to curate data and metadata independently is a big win for me.

alexshpilkin commented 5 years ago

See my second comment: maybe you could avoid matching names (not titles) with the header? Then the header can remain a source of titles, but I’ll be able to specify the names in the schema.

alexshpilkin commented 5 years ago

... Ah, I see, you’re treating the schema columns (or the header columns?) as truly unordered. Yes, that would be a problem.

xrotwang commented 5 years ago

Well, I have both orders: The row dict keeps the order from the data file, and the tableSchema has an ordered list of column descriptions.

alexshpilkin commented 5 years ago

I guess you could match titles first, and then match the remaining columns (in both metadata and header) using their ordering only. The columns left until the second step in the metadata should not have titles (or a warning is raised).

alexshpilkin commented 5 years ago

This neatly agrees with the header=false case, by the way, because then in the first step there are no header titles to match.

xrotwang commented 5 years ago

Considering that I already have many datasets managed with this package, backwards compatibility is really important for me - so I'm rather leaning towards a strict mode, i.e. match the columns by index only, possibly adding the column names from the CSV file as additional titles, and possibly adding column descriptions for surplus columns from the CSV.

xrotwang commented 5 years ago

If there was more tooling for CSVW in general, I'd be inclined to follow the spec more closely (but I'd attribute the lack of tooling also to the complexity of implementing the spec :) ). But as it stands, I would regard this rather as a documentation issue - even though it means documenting a conflict with the spec.

alexshpilkin commented 5 years ago

Fair enough.

xrotwang commented 5 years ago

@alexshpilkin feel free to re-open, if the workaround isn't possible with your data!