cldf-datasets / doreco

CLDF dataset derived from DoReCo's core corpus
https://doreco.info/
3 stars 0 forks source link

Load custom tables in Python: A metadata issue? #9

Closed FredericBlum closed 1 year ago

FredericBlum commented 2 years ago

I am currently trying to import the CLDf dataset in Python with the following code:

from pycldf import Dataset
doreco = Dataset.from_metadata('doreco_cldf/cldf/StructureDataset-metadata.json')

for x in doreco.components:
    print(x)

However, the output only contains three components: ValueTable, LanguageTable, and ContributionTable. How do I have to adapt the code from cldfbench so that I can access the other, custom added components, as well? Could you point me to the relevant code @xrotwang ?

xrotwang commented 2 years ago

Custom tables are not components, i.e. not specified by CLDF. To enumerate all tables, you could use

>>> from pycldf import Dataset
>>> doreco = Dataset.from_metadata('cldf/StructureDataset-metadata.json')
>>> for table in doreco.tables:
...     print(table.url)
... 
values.csv
languages.csv
contributions.csv
phones.csv
words.csv
metadata.csv
glosses.csv
FredericBlum commented 2 years ago

I am still struggling with accessing the phones.csv table. What I want to do: Access all phones, and add information from the LanguageTable and metadata.csv to this object.

I can access the doreco.tables as you showed, but how do I get to the content of those tables? I do not quite understand the description of pycldf's ORM for this use-case. From the README it seems like I should use doreco.objects('phones.csv'), but then I get an error, so I can never access phone.language or phone.metadata etc. As there is no ValueTable, what is my entry point to all the data points and their cross-links? Could you provide me with a basic example that I could expand?

xrotwang commented 2 years ago

pycldf.orm does not work for custom tables. So you'd access the data simply by iterating over the table:

phones = {r['ID']: r for r in cldf['phones.csv']}

You'll get back a dict per row, with typed data according to the metadata. I.e. that's basically csvw functionality.

xrotwang commented 2 years ago

Considering the size of the tables it may even be worth looking into working with SQLite here, i.e. with the result of

cldf createdb cldf/metadata.json doreco.sqlite
FredericBlum commented 2 years ago
phones = {r['ID']: r for r in cldf['phones.csv']}

But what exactly is cldf in this case? If I load the dataset like this: doreco = Dataset.from_metadata('../../doreco_cldf/cldf/StructureDataset-metadata.json')

I can neither call doreco['phones.csv'], nor doreco.tables['phones.csv'], nor doreco.tables.url('phones.csv'). I feel like I am missing basic CLDF knowledge here :(

xrotwang commented 2 years ago
>>> from pycldf import Dataset
>>> doreco = Dataset.from_metadata('cldf/StructureDataset-metadata.json')
>>> phones = list(doreco['phones.csv'])
>>> len(phones)
1866499
>>> phones[0]
OrderedDict([('Language_ID', 'taba1259'), ('Filename', 'doreco_taba1259_mc_tabasaran_belt'), ('speaker', 'taba1259_TS01'), ('ph_ID', 'taba1259_p1'), ('ph', '<p:>'), ('start', '0.000'), ('end', '3.504'), ('duration', '3504'), ('wd_ID', 'taba1259_w1')])
xrotwang commented 2 years ago

doreco['phones.csv'] is an instance of csvw.Table, so you can iterate over it.

xrotwang commented 2 years ago

Btw.: I just tried creating a SQLite db, but this fails because the dataset is declared as StructureDataset but is missing a ValueTable. So the CLDF module should be changed.

FredericBlum commented 2 years ago

Aaaah, so the problem was that "phones.csv" has no ID that could be retrieved, but the call was correct.. Thank you!

What should be the content of ValueTable though? Should this be what is now phones.csv?

xrotwang commented 2 years ago

No, the dataset shouldn't be a StructureDataset at all, but Generic.

xrotwang commented 2 years ago

I'll prepare a PR, also adding IDs for phones and words.

FredericBlum commented 2 years ago

Ok. ID's already exist, its just that they are named ph_ID and wd_ID right now. They are also linked to the cldf propertyURL.

xrotwang commented 2 years ago

Yes, I see. There are a few problems, though, which I still need to fix, e.g. non-unique IDs in metadata:

$ csvstat cldf/metadata.csv 
  1. "ID"

    Type of data:          Text
    Contains null values:  False
    Unique values:         963
    Longest value:         13 characters
    Most common values:    bora1263_T039 (2x)
                           even1259_T004 (2x)
                           anal1239_T001 (1x)
                           anal1239_T002 (1x)
                           anal1239_T003 (1x)
xrotwang commented 2 years ago

What do you think we should do here:

$ grep T039 raw/bora1263_metadata.csv 
T039,llijchu_ine_II1_00-16,JUM,61,certain,f,2005-08-09,certain,traditional narrative,na,all,Spanish,medium,constant,978,no
T039,llijchu_ine_II1_16-END,JUM,61,certain,f,2005-08-09,certain,traditional narrative,na,all,Spanish,medium,constant,4327,yes
FredericBlum commented 2 years ago

What do you think we should do here:

$ grep T039 raw/bora1263_metadata.csv 
T039,llijchu_ine_II1_00-16,JUM,61,certain,f,2005-08-09,certain,traditional narrative,na,all,Spanish,medium,constant,978,no
T039,llijchu_ine_II1_16-END,JUM,61,certain,f,2005-08-09,certain,traditional narrative,na,all,Spanish,medium,constant,4327,yes

For some reason, the file (T039) was split in two parts (00-16, 16-END). So I guess the unique name is the second column and not the file number.

xrotwang commented 2 years ago

Ah, ok. Apart from the naming that makes sense, because Filename is also used as target for the foreign key from words.csv.