cldf / pycldf

python package to read and write CLDF datasets
https://cldf.clld.org
Apache License 2.0
15 stars 7 forks source link

How to handle unglossed words? #158

Open fmatter opened 1 year ago

fmatter commented 1 year ago

Quite often, people will not gloss words like person or place names or unparsable words, so some words may only be present in Primary_Text, but not in Analyzed_Word or Gloss.

The most transparent way to store an example like that in CLDF is to have an empty list item in these two columns:

Primary_Text: "x y Person z" Analyzed_Word: "x\ty\t\tz" (["x","y",None,"z"] once read by pycldf) Gloss: "xg\tyg\t\tzg" (["xg","yg",None,"zg"])

This passes validation, but for example cldf createdb does not work (TypeError: sequence item 1: expected str instance, NoneType found) and I've been doing things like ex["Analyzed_Word"] = ["" if x is None else x for x in ex["Analyzed_Word"]] in initializedb.py scripts.

Should empty items in a gloss column raise an error upon validation? If yes, is the way to handle unglossed words to simply leave them out? (i.e. "x\ty\tz" ["x","y","z"])? Or, if empty items are allowed, would it be OK for pycldf to yield "" instead of None (i.e. "x\ty\t\tz" ["x","y","","z"])?

xrotwang commented 1 year ago

Hm, the most transparent practice I've seen in this regard is using ellipsis (ideally the Unicode character U+2026, and not three dots ...) in both, Analyzed_Word and Gloss. Admittedly, this is also often used very inconsistently - leaving out the ellipsis in the Gloss, etc. But from my point of view, recommending this practice would also raise awareness of the fact that the ellipsis is part of the example, and must be considered for consistency.

fmatter commented 1 year ago

That's a very reasonable solution, works for me.

Should None in tab-delimited columns raise a validation error?

xrotwang commented 1 year ago

Should None in tab-delimited columns raise a validation error?

Yes, I would say so. After all, one of the main reasons for using ellipsis for unglossed words is that we get lists of str for both aligned properties.

xrotwang commented 1 year ago

Yes, I would say so. After all, one of the main reasons for using ellipsis for unglossed words is that we get lists of str for both aligned properties.

Maybe we could keep some sort of backwards compatibility (with somewhat undefined bahaviour) by converting None to ellipsis upon reading.