Closed xrotwang closed 4 years ago
To some extent that's alleviated now by the SQLite backend, which uses local names of ontology terms as column names.
Maybe this functionality should be implemented as separate, alternative way to iterate over rows of a table in a CLDF dataset? So iter_cldf_rows(table)
would be implemented as follows:
def iter_cldf_rows(self, table):
table = self[table]
col_map = {}
for col in table.tableSchema.columns:
if col.propertyUrl and is_cldf_property(col.propertyUrl):
col_map[col.name] = col.propertyUrl.split('#')[1]
for row in table:
yield {col_map.get(k, k): v for k, v in row.items()}
Another alternative -- What about using the CLDF metadata and terms.rdf to set properties on Table instances e.g. cldf['LanguageTable'].glottocode is whatever points to http://cldf.clld.org/v1.0/terms.rdf#glottocode. Then you'd easily know whatever fields are glottocode fields, so something like this would work:
for row in cldf['LanguageTable']:
glottocode = row[cldf['LanguageTable'].glottocode]
Ok, that's a neat idea, too. Although I'd want an intermediate container object in between, e.g.
row[cldf['LanguageTable'].c.glottocode]
or we go even further and do:
row[cldf.t.LanguageTable.c.glottocode]
But this would need to be wired with all the schema manipulation methods. And we already have relatively easy column access:
row[cldf['LanguageTable', 'glottocode'].name]
After a bit more thinking, I'd now say, row[cldf['LanguageTable', 'glottocode'].name]
is good enough. And if performance issues caused by the repeated column lookups are a problem, then resort to assembling a column map before iterating over a table.
I think I like your idea better, as it requires me to know less about the table. One thing though, is that I can't see when I'd ever want the other default iterator!
re: properties, it'd be nice to just have cldf.LanguageTable
rather than cldf['LanguageTable']
, and then you'd have row[cldf.LanguageTable, 'glottocode'].name]
which is a bit easier to read.
But caching the table is easy, and Dataset.__getitem__
accepts Table
objects as first item:
t = cldf['LanguageTable']
for row in t:
row[cldf[t, 'glottocode'].name]
Oh, what about lightly subclassing dict, with "#glottocode" returning the field defined in metadata? then you don't need another iter_
function, the class level iterator for row in cldf['table']
still works, and it should have less cost than lots of property lookups?:
for row in cldf['LanguageTable']:
row['#glottocode']
edit -- or just add a method:
for row in cldf['LanguageTable']:
row.get_term('glottocode')
(I may be a bit biased as pycharm user, though. This made me dislike dynamic attrributes via setattr
a lot.)
oh yes, I hadn't thought of that: using getattr to look for tables, yuck!
But the rows we iterate over (in fact the Table
itself) are created by csvw
already, not by pycldf
. So customizing the row class would require an alternative iter
method on pycldf.Dataset
.
So what we could do within csvw
is adding a method get_property
- because csvw
knows about propertyUrls as well. But the shortcut #glottocode
or similar, would require knowledge of the way CLDF propertyUrls are structured - something csvw
does not have, and row.get_property('http://cldf.clld.org/v1/glottocode')
isn't much of a shortcut anymore.
@SimonGreenhill I'm leaning towards simply recommending row[cldf['LanguageTable', 'glottocode'].name]
as pattern, and leaving it at this. Strong objections?
Ok, I still think it's clunky, but better to leave as-is that implement another way that's not perfect. Let's see how annoying it gets.
At least it's somewhat consistent: Terms of the ontology are always strings, never attributes - which makes for a looser coupling, e.g. should we ever introduce ontology terms which would not be valid Python names.
The
dict
objects returned when iterating over CLDF tables should be augmented to allow lookup via term URIs for easier processing of standard column values.