Closed FredericBlum closed 1 year ago
Custom tables are not components, i.e. not specified by CLDF. To enumerate all tables, you could use
>>> from pycldf import Dataset
>>> doreco = Dataset.from_metadata('cldf/StructureDataset-metadata.json')
>>> for table in doreco.tables:
... print(table.url)
...
values.csv
languages.csv
contributions.csv
phones.csv
words.csv
metadata.csv
glosses.csv
I am still struggling with accessing the phones.csv
table. What I want to do: Access all phones, and add information from the LanguageTable and metadata.csv to this object.
I can access the doreco.tables
as you showed, but how do I get to the content of those tables? I do not quite understand the description of pycldf's ORM for this use-case. From the README it seems like I should use doreco.objects('phones.csv')
, but then I get an error, so I can never access phone.language or phone.metadata etc. As there is no ValueTable, what is my entry point to all the data points and their cross-links? Could you provide me with a basic example that I could expand?
pycldf.orm
does not work for custom tables. So you'd access the data simply by iterating over the table:
phones = {r['ID']: r for r in cldf['phones.csv']}
You'll get back a dict
per row, with typed data according to the metadata. I.e. that's basically csvw
functionality.
Considering the size of the tables it may even be worth looking into working with SQLite here, i.e. with the result of
cldf createdb cldf/metadata.json doreco.sqlite
phones = {r['ID']: r for r in cldf['phones.csv']}
But what exactly is cldf
in this case? If I load the dataset like this:
doreco = Dataset.from_metadata('../../doreco_cldf/cldf/StructureDataset-metadata.json')
I can neither call doreco['phones.csv']
, nor doreco.tables['phones.csv']
, nor doreco.tables.url('phones.csv')
. I feel like I am missing basic CLDF knowledge here :(
>>> from pycldf import Dataset
>>> doreco = Dataset.from_metadata('cldf/StructureDataset-metadata.json')
>>> phones = list(doreco['phones.csv'])
>>> len(phones)
1866499
>>> phones[0]
OrderedDict([('Language_ID', 'taba1259'), ('Filename', 'doreco_taba1259_mc_tabasaran_belt'), ('speaker', 'taba1259_TS01'), ('ph_ID', 'taba1259_p1'), ('ph', '<p:>'), ('start', '0.000'), ('end', '3.504'), ('duration', '3504'), ('wd_ID', 'taba1259_w1')])
doreco['phones.csv']
is an instance of csvw.Table, so you can iterate over it.
Btw.: I just tried creating a SQLite db, but this fails because the dataset is declared as StructureDataset
but is missing a ValueTable
. So the CLDF module should be changed.
Aaaah, so the problem was that "phones.csv" has no ID that could be retrieved, but the call was correct.. Thank you!
What should be the content of ValueTable though? Should this be what is now phones.csv
?
No, the dataset shouldn't be a StructureDataset
at all, but Generic
.
I'll prepare a PR, also adding IDs for phones and words.
Ok. ID's already exist, its just that they are named ph_ID
and wd_ID
right now. They are also linked to the cldf propertyURL.
Yes, I see. There are a few problems, though, which I still need to fix, e.g. non-unique IDs in metadata:
$ csvstat cldf/metadata.csv
1. "ID"
Type of data: Text
Contains null values: False
Unique values: 963
Longest value: 13 characters
Most common values: bora1263_T039 (2x)
even1259_T004 (2x)
anal1239_T001 (1x)
anal1239_T002 (1x)
anal1239_T003 (1x)
What do you think we should do here:
$ grep T039 raw/bora1263_metadata.csv
T039,llijchu_ine_II1_00-16,JUM,61,certain,f,2005-08-09,certain,traditional narrative,na,all,Spanish,medium,constant,978,no
T039,llijchu_ine_II1_16-END,JUM,61,certain,f,2005-08-09,certain,traditional narrative,na,all,Spanish,medium,constant,4327,yes
What do you think we should do here:
$ grep T039 raw/bora1263_metadata.csv T039,llijchu_ine_II1_00-16,JUM,61,certain,f,2005-08-09,certain,traditional narrative,na,all,Spanish,medium,constant,978,no T039,llijchu_ine_II1_16-END,JUM,61,certain,f,2005-08-09,certain,traditional narrative,na,all,Spanish,medium,constant,4327,yes
For some reason, the file (T039) was split in two parts (00-16, 16-END). So I guess the unique name is the second column and not the file number.
Ah, ok. Apart from the naming that makes sense, because Filename
is also used as target for the foreign key from words.csv
.
I am currently trying to import the CLDf dataset in Python with the following code:
However, the output only contains three components: ValueTable, LanguageTable, and ContributionTable. How do I have to adapt the code from cldfbench so that I can access the other, custom added components, as well? Could you point me to the relevant code @xrotwang ?