cldf / pycldf

python package to read and write CLDF datasets
https://cldf.clld.org
Apache License 2.0
15 stars 7 forks source link

Augment row `dict`s with term URIs keys #86

Closed xrotwang closed 4 years ago

xrotwang commented 5 years ago

The dict objects returned when iterating over CLDF tables should be augmented to allow lookup via term URIs for easier processing of standard column values.

xrotwang commented 5 years ago

To some extent that's alleviated now by the SQLite backend, which uses local names of ontology terms as column names.

xrotwang commented 4 years ago

Maybe this functionality should be implemented as separate, alternative way to iterate over rows of a table in a CLDF dataset? So iter_cldf_rows(table) would be implemented as follows:

def iter_cldf_rows(self, table):
    table = self[table]
    col_map = {}
    for col in table.tableSchema.columns:
        if col.propertyUrl and is_cldf_property(col.propertyUrl):
            col_map[col.name] = col.propertyUrl.split('#')[1]
    for row in table:
        yield {col_map.get(k, k): v for k, v in row.items()}
SimonGreenhill commented 4 years ago

Another alternative -- What about using the CLDF metadata and terms.rdf to set properties on Table instances e.g. cldf['LanguageTable'].glottocode is whatever points to http://cldf.clld.org/v1.0/terms.rdf#glottocode. Then you'd easily know whatever fields are glottocode fields, so something like this would work:

for row in cldf['LanguageTable']:
   glottocode = row[cldf['LanguageTable'].glottocode]
xrotwang commented 4 years ago

Ok, that's a neat idea, too. Although I'd want an intermediate container object in between, e.g.

row[cldf['LanguageTable'].c.glottocode]

or we go even further and do:

row[cldf.t.LanguageTable.c.glottocode]
xrotwang commented 4 years ago

But this would need to be wired with all the schema manipulation methods. And we already have relatively easy column access:

row[cldf['LanguageTable', 'glottocode'].name]
xrotwang commented 4 years ago

After a bit more thinking, I'd now say, row[cldf['LanguageTable', 'glottocode'].name] is good enough. And if performance issues caused by the repeated column lookups are a problem, then resort to assembling a column map before iterating over a table.

SimonGreenhill commented 4 years ago

I think I like your idea better, as it requires me to know less about the table. One thing though, is that I can't see when I'd ever want the other default iterator!

re: properties, it'd be nice to just have cldf.LanguageTable rather than cldf['LanguageTable'], and then you'd have row[cldf.LanguageTable, 'glottocode'].name] which is a bit easier to read.

xrotwang commented 4 years ago

But caching the table is easy, and Dataset.__getitem__ accepts Table objects as first item:

t = cldf['LanguageTable']
for row in t:
    row[cldf[t, 'glottocode'].name]
SimonGreenhill commented 4 years ago

Oh, what about lightly subclassing dict, with "#glottocode" returning the field defined in metadata? then you don't need another iter_ function, the class level iterator for row in cldf['table'] still works, and it should have less cost than lots of property lookups?:

for row in cldf['LanguageTable']:
    row['#glottocode']

edit -- or just add a method:

for row in cldf['LanguageTable']:
    row.get_term('glottocode')
xrotwang commented 4 years ago

(I may be a bit biased as pycharm user, though. This made me dislike dynamic attrributes via setattr a lot.)

SimonGreenhill commented 4 years ago

oh yes, I hadn't thought of that: using getattr to look for tables, yuck!

xrotwang commented 4 years ago

But the rows we iterate over (in fact the Table itself) are created by csvw already, not by pycldf. So customizing the row class would require an alternative iter method on pycldf.Dataset.

xrotwang commented 4 years ago

So what we could do within csvw is adding a method get_property - because csvw knows about propertyUrls as well. But the shortcut #glottocode or similar, would require knowledge of the way CLDF propertyUrls are structured - something csvw does not have, and row.get_property('http://cldf.clld.org/v1/glottocode') isn't much of a shortcut anymore.

xrotwang commented 4 years ago

@SimonGreenhill I'm leaning towards simply recommending row[cldf['LanguageTable', 'glottocode'].name] as pattern, and leaving it at this. Strong objections?

SimonGreenhill commented 4 years ago

Ok, I still think it's clunky, but better to leave as-is that implement another way that's not perfect. Let's see how annoying it gets.

xrotwang commented 4 years ago

At least it's somewhat consistent: Terms of the ontology are always strings, never attributes - which makes for a looser coupling, e.g. should we ever introduce ontology terms which would not be valid Python names.