Should Dataset.write ignore columns it doesn't recognize?

cldf / csvw

CSV on the web

Apache License 2.0

36 stars 5 forks source link

Should Dataset.write ignore columns it doesn't recognize? #57

Closed Anaphory closed 6 months ago

Anaphory commented 2 years ago

When writing a table, the writer https://github.com/cldf/csvw/blob/45584ad63ff3002a9b3a8073607c1847c5cbac58/src/csvw/metadata.py#L649-L652 ignores all columns not in the table schema. I have used this a few times, in particular in quick-and-dirty scripts to manipulate CLDF data, but I have also been sometimes confused about outputs (eg. due to typos in the column name), and it is also a deviation from the Python core csv behaviour. Is it a conscious choice here, or is this a thing that we should consider changing down the line, preferably before too many other places start relying on it?

xrotwang commented 2 years ago

I think that would open up a big can of worms - e.g. in cases where different row dicts have different un-specified columns. I'm also not a big fan of csv's behaviour here - possibly leading to csv rows with different number of columns - which makes quite a few csv consuming tools choke.

Anaphory commented 2 years ago

Huh? I'm not suggesting making this more loose. I'm suggesting to make it stricter. Currently, when the metadata contain a Source column and the row to be written is {'source': ['forkel2018cross']}, silently, an empty cell is written and the superfluous dict key is ignored. For example, csv.DictWriter would throw

ValueError: dict contains fields not in fieldnames: 'source'

I think CLDF should do the same, but maybe you have a good reason to not do it. (I have abused this silent dropping of superfluous fields on rare occasions, so it's not entirely without merit, that's why I ask.)

xrotwang commented 2 years ago

@Anaphory ah, ok. I think now I understand. I'm not sure, I'd want to break backwards-compatibility for this, though. But something like a strict=True flag for Table.write would be ok. To make this usable from libraries like pycldf, these would need to be changed, too, of course.