Closed SimonGreenhill closed 3 years ago
This is what we have in terms of docs for this issue so far: https://pycldf.readthedocs.io/en/latest/dataset.html#accessing-schema-objects-components-tables-columns-etc Do you think a particular method should be made canonical?
Hmm. I guess I want to make sure my code is robust for things like "check if there is a cognates table", "give me the cognates table". What is the best way to do that?
Ok, will try to add a something tutorial-like to the docs. The general idea, though, is that Dataset
acts like a dictionary when it comes to accessing schema objects like tables and columns, i.e.:
"CognateTable" in dataset
dataset["CognateTable"]
None
otherwise with dataset.get("CognateTable")
.For column objects the keys would be pairs like ("CognateTable", "id")
.
Ahh, that helps -- thanks! I've been doing this:
def get_cognates(cldf):
# do we have a cognates table?
CognateTable = cldf.get('CognateTable')
if CognateTable is None:
raise CLDFException("No Cognate Table found in CLDF dataset!")
FormTable = cldf.get('FormTable')
if FormTable is None: # pragma: no cover
raise CLDFException("No Forms Table found in CLDF dataset!")
# get form_id -> language_id mappings as these are not in cognates.csv
forms = {r['ID']: r['Language_ID'] for r in FormTable}
cognates = defaultdict(set)
for i, row in enumerate(CognateTable, 1):
if not row.get('Doubt'):
if row['Form_ID'] not in forms: # pragma: no cover
raise CLDFException("Invalid form_id %s" % row['Form_ID'])
cognates[row['Cognateset_ID']].add(forms[row['Form_ID']])
return cognates
...which would then be better as:
def get_cognates(cldf):
# do we have a cognates table?
if "CognateTable" not in cldf:
raise CLDFException("No Cognate Table found in CLDF dataset!")
if "FormTable" not in cldf:
raise CLDFException("No Forms Table found in CLDF dataset!")
# get form_id -> language_id mappings as these are not in cognates.csv
forms = {r['ID']: r['Language_ID'] for r in dataset["FormTable"]}
cognates = defaultdict(set)
for i, row in enumerate(dataset["CognateTable"], 1):
if not row.get('Doubt'):
if row['Form_ID'] not in forms: # pragma: no cover
raise CLDFException("Invalid form_id %s" % row['Form_ID'])
cognates[row['Cognateset_ID']].add(forms[row['Form_ID']])
return cognates
Maybe we should just make the messages of the KeyError
s here https://github.com/cldf/pycldf/blob/275bb9f3aedfed671b67d85ba555f017fb38f501/src/pycldf/dataset.py#L369-L381 a bit more telling. Then you could just do dataset["FormTable"]
above and get mostly the same behaviour.
That would work, perhaps a new exception class TableError
to enable a try..except to catch it?
Why does that error not inherit from KeyError
, so that existing try/except blocks that catch the KeyError
which has been thrown in this situation in the past continue to work?
Do you intend to use pycldf.dataset.SchemaError
also in other places with other meanings?
If we have this kind of special Exceptions anyway (I thought I remembered @xrotwang being vehemently opposed to Exception subclasses you have to import to catch them in the past? Am I misremembering, or has your opinion changed?), can we please use it fully?
In one bit of code which I'm really not proud of, I just found the following snippet:
try:
...
# Try accessing a lot of different columns a subsequent computation needs
except KeyError:
formatted_lines = traceback.format_exc().splitlines()
match = re.match(r'.*"(.*?)", "(.*?)".*', formatted_lines[2])
cli.Exit.INVALID_COLUMN_NAME(
f"The {match.group(1)} is missing the column {match.group(2)} which is required by cldf."
)
A clean split into pycldf.dataset.TableMissingError
and pycldf.dataset.ColumnMissingError
, each deriving from pycldf.dataset.SchemaError
and KeyError
, and each carrying in attributes the specific table/column that is missing, would make this kind of test and error reporting to the user much more ergonomical. So I guess I will keep the abyssmal hack you see above until I know whether to change it into except ColumnMissingError as e: complain_to_user_about(e.column)
or a separate try/except block around every column I need to check for.
@Anaphory SchemaError
does subclass KeyError
.
@Anaphory
>>> from pycldf import Wordlist
>>> wl = Wordlist.in_dir('.')
>>> try:
... wl['xtable']
... except KeyError as e:
... print(str(e))
... pass
...
'Dataset has no table "xtable"'
I wouldn't consider changing the error message a breaking change.
Strange, it does indeed inherit from KeyError, which makes me very confused why my code broke. I'll investigate. Changing the error message is not a breaking change, sure, and that error message is actually nice.
I keep doing these things where I slightly misread an error message, go on a wild spree trying to fix it in all the wrong places, and then think something is wrong in pycldf
because all my wrong fixes don't work, don't I? Problem between screen and chair, as my dad tends to say.
As a rule of thumb: 99% of all bugs you find are in your own code. It takes a couple of years to accept that - and then, the year after you find a bug in python :)
e.g. how do you find the table that contains, say, cognates. Is it safe to assume that
cldf.objects('CognateTable')
is correct, or is something likecldf.get('CognateTable')
better?