Closed alexshpilkin closed 5 years ago
Ok, here's what's going on: If you want the header in the CSV file ignored, you'd have to specify
"header": false,
"skipRows": 1
in the table's dialect
. Otherwise iterdicts
tries to match the column names in the file header with name
or titles
in the metadata. The advantage of this is that it allows re-ordering of the columns in the CSV file without invalidating the metadata.
Alternatively, you could specify the column names from the CSV as titles
in the metadata. Then they will be renamed to the name
property in the resulting dict
.
Maybe we should emit a warning upon encountering columns in a file which cannot be matched to any column in the metadata. I wouldn't make this an error, though, since a somewhat common use case for me is creating metadata for data files which are not completely under my control - and this more robust, if re-ordering of columns, and addition of non-described columns in the data is possible.
I’m afraid this matching isn’t actually permitted by the spec.
Model §6.1:
1 Retrieve the metadata file yielding the metadata UM (which is treated as overriding metadata, see § 5.1 Overriding Metadata). [...] 3 For each table (TM) in UM in order, create one or more annotated tables: [...] 3.3 Parse the tabular data file, using DD as a guide, to create a basic tabular data model (T) and extract embedded metadata (EM), for example from the header line. [...] 3.5 Verify that TM is compatible with EM using the procedure defined in Table Description Compatibility in [tabular-metadata]; if TM is not compatible with EM validators MUST raise an error, other processors MUST generate a warning and continue processing. 3.6 Use the metadata TM to add annotations to the tabular data model T as described in Section 2 Annotating Tables in [tabular-metadata].
Note that the last point ignores EM completely.
The definition of compatibility in Vocabulary §5.5.1 specifically includes ordering:
Two schemas are compatible if they have the same number of non-virtual column descriptions, and the non-virtual column descriptions at the same index within each are compatible with each other. [... Goes on to require matching of titles when specified in both schemas.]
In any case, the (non-normative) spec for parsing CSV in Model §8 says one is supposed to treat the header as containing titles, not names:
7.3.2.2 Otherwise, if there is no column description object at index i in M.tableSchema.columns, create a new one with a title property whose value is an array containing a single value that is the value at index i in the list of cell values.
Schema compatibility (see above) then says one is supposed to match two column descriptions when one only contains a title and another only a name.
Column descriptions are compatible under the following conditions:
- [...]
- If not validating, and one schema has a name property but not a titles property, and the other has a titles property but not a name property.
Hm. Maybe one could implement some sort of strict
mode. I wouldn't want to lose the flexibility of the current system, though, because as I said, being able to curate data and metadata independently is a big win for me.
See my second comment: maybe you could avoid matching name
s (not titles
) with the header? Then the header can remain a source of titles
, but I’ll be able to specify the name
s in the schema.
... Ah, I see, you’re treating the schema columns
(or the header columns?) as truly unordered. Yes, that would be a problem.
Well, I have both orders: The row dict
keeps the order from the data file, and the tableSchema
has an ordered list of column descriptions.
I guess you could match titles
first, and then match the remaining columns (in both metadata and header) using their ordering only. The columns left until the second step in the metadata should not have titles
(or a warning is raised).
This neatly agrees with the header=false
case, by the way, because then in the first step there are no header titles to match.
Considering that I already have many datasets managed with this package, backwards compatibility is really important for me - so I'm rather leaning towards a strict mode, i.e. match the columns by index only, possibly adding the column names from the CSV file as additional titles, and possibly adding column descriptions for surplus columns from the CSV.
If there was more tooling for CSVW in general, I'd be inclined to follow the spec more closely (but I'd attribute the lack of tooling also to the complexity of implementing the spec :) ). But as it stands, I would regard this rather as a documentation issue - even though it means documenting a conflict with the spec.
Fair enough.
@alexshpilkin feel free to re-open, if the workaround isn't possible with your data!
I‘m observing iterdicts() using the CSV header instead of
name
s ortitles
. Test casse: test.zip.should be