cldf / pycldf

python package to read and write CLDF datasets
https://cldf.clld.org
Apache License 2.0
15 stars 7 forks source link

Document the best way to get a specific table type #145

Closed SimonGreenhill closed 3 years ago

SimonGreenhill commented 3 years ago

e.g. how do you find the table that contains, say, cognates. Is it safe to assume that cldf.objects('CognateTable') is correct, or is something like cldf.get('CognateTable') better?

xrotwang commented 3 years ago

This is what we have in terms of docs for this issue so far: https://pycldf.readthedocs.io/en/latest/dataset.html#accessing-schema-objects-components-tables-columns-etc Do you think a particular method should be made canonical?

SimonGreenhill commented 3 years ago

Hmm. I guess I want to make sure my code is robust for things like "check if there is a cognates table", "give me the cognates table". What is the best way to do that?

xrotwang commented 3 years ago

Ok, will try to add a something tutorial-like to the docs. The general idea, though, is that Dataset acts like a dictionary when it comes to accessing schema objects like tables and columns, i.e.:

For column objects the keys would be pairs like ("CognateTable", "id").

SimonGreenhill commented 3 years ago

Ahh, that helps -- thanks! I've been doing this:


def get_cognates(cldf):
    # do we have a cognates table?
    CognateTable = cldf.get('CognateTable')
    if CognateTable is None:
        raise CLDFException("No Cognate Table found in CLDF dataset!")

    FormTable = cldf.get('FormTable')
    if FormTable is None:  # pragma: no cover
        raise CLDFException("No Forms Table found in CLDF dataset!")

    # get form_id -> language_id mappings as these are not in cognates.csv
    forms = {r['ID']: r['Language_ID'] for r in FormTable}

    cognates = defaultdict(set)
    for i, row in enumerate(CognateTable, 1):
        if not row.get('Doubt'):
            if row['Form_ID'] not in forms:  # pragma: no cover
                raise CLDFException("Invalid form_id %s" % row['Form_ID'])
            cognates[row['Cognateset_ID']].add(forms[row['Form_ID']])
    return cognates
SimonGreenhill commented 3 years ago

...which would then be better as:


def get_cognates(cldf):
    # do we have a cognates table?
    if "CognateTable" not in cldf:
        raise CLDFException("No Cognate Table found in CLDF dataset!")

    if "FormTable" not in cldf:
        raise CLDFException("No Forms Table found in CLDF dataset!")

    # get form_id -> language_id mappings as these are not in cognates.csv
    forms = {r['ID']: r['Language_ID'] for r in dataset["FormTable"]}

    cognates = defaultdict(set)
    for i, row in enumerate(dataset["CognateTable"], 1):
        if not row.get('Doubt'):
            if row['Form_ID'] not in forms:  # pragma: no cover
                raise CLDFException("Invalid form_id %s" % row['Form_ID'])
            cognates[row['Cognateset_ID']].add(forms[row['Form_ID']])
    return cognates
xrotwang commented 3 years ago

Maybe we should just make the messages of the KeyErrors here https://github.com/cldf/pycldf/blob/275bb9f3aedfed671b67d85ba555f017fb38f501/src/pycldf/dataset.py#L369-L381 a bit more telling. Then you could just do dataset["FormTable"] above and get mostly the same behaviour.

SimonGreenhill commented 3 years ago

That would work, perhaps a new exception class TableError to enable a try..except to catch it?

Anaphory commented 2 years ago

Why does that error not inherit from KeyError, so that existing try/except blocks that catch the KeyError which has been thrown in this situation in the past continue to work?

Do you intend to use pycldf.dataset.SchemaError also in other places with other meanings?

Anaphory commented 2 years ago

If we have this kind of special Exceptions anyway (I thought I remembered @xrotwang being vehemently opposed to Exception subclasses you have to import to catch them in the past? Am I misremembering, or has your opinion changed?), can we please use it fully?

In one bit of code which I'm really not proud of, I just found the following snippet:

    try:
        ...
        # Try accessing a lot of different columns a subsequent computation needs
    except KeyError:
        formatted_lines = traceback.format_exc().splitlines()
        match = re.match(r'.*"(.*?)", "(.*?)".*', formatted_lines[2])
        cli.Exit.INVALID_COLUMN_NAME(
            f"The {match.group(1)} is missing the column {match.group(2)} which is required by cldf."
        )

A clean split into pycldf.dataset.TableMissingError and pycldf.dataset.ColumnMissingError, each deriving from pycldf.dataset.SchemaError and KeyError, and each carrying in attributes the specific table/column that is missing, would make this kind of test and error reporting to the user much more ergonomical. So I guess I will keep the abyssmal hack you see above until I know whether to change it into except ColumnMissingError as e: complain_to_user_about(e.column) or a separate try/except block around every column I need to check for.

xrotwang commented 2 years ago

@Anaphory SchemaError does subclass KeyError.

xrotwang commented 2 years ago

@Anaphory

>>> from pycldf import Wordlist
>>> wl = Wordlist.in_dir('.')
>>> try:
...     wl['xtable']
... except KeyError as e:
...     print(str(e))
...     pass
... 
'Dataset has no table "xtable"'

I wouldn't consider changing the error message a breaking change.

Anaphory commented 2 years ago

Strange, it does indeed inherit from KeyError, which makes me very confused why my code broke. I'll investigate. Changing the error message is not a breaking change, sure, and that error message is actually nice.

Anaphory commented 2 years ago

I keep doing these things where I slightly misread an error message, go on a wild spree trying to fix it in all the wrong places, and then think something is wrong in pycldf because all my wrong fixes don't work, don't I? Problem between screen and chair, as my dad tends to say.

xrotwang commented 2 years ago

As a rule of thumb: 99% of all bugs you find are in your own code. It takes a couple of years to accept that - and then, the year after you find a bug in python :)