cldf-datasets / doreco

CLDF dataset derived from DoReCo's core corpus
https://doreco.info/
3 stars 0 forks source link

valid CLDF with working conversion to sqlite #13

Closed xrotwang closed 1 year ago

xrotwang commented 2 years ago

Working with the ~2Mio rows in phones.csv via pycldf might be kind of cumbersome due to the slowness. Now you could pay the price once

$ time cldf createdb cldf/Generic-metadata.json doreco.sqlite
INFO    <cldf:v1.0:Generic at cldf> loaded in doreco.sqlite

real    3m25,673s
user    3m22,994s
sys 0m2,521s

and get fast access later

$ time sqlite3 doreco.sqlite 'select count(*) from `phones.csv`'
1866499

real    0m0,021s
user    0m0,009s
sys 0m0,013s
FredericBlum commented 2 years ago

The SQL database seems great for all cases that use the data directly. But for all kinds of preprocessing, I would still have to loop through the individual rows, right?

xrotwang commented 2 years ago

What kinds of preprocessing do you mean?

FredericBlum commented 2 years ago

For my study, for example, I calculate the log-count of phonemes per second in all interpausal units, or annotate the position of phonemes within their word. This is why I assume that looping through the data-frame is necessary anyway.

FredericBlum commented 2 years ago

All calculations involve a lot of looking at preceding and succeding data points. Computationally heavy, but conceptually simple I guess.

xrotwang commented 2 years ago

If narrowing down to "interpausal units" is simple to do with a SQL query, you might be able to cut down the amount of stuff to loop through considerably. From my experience, trial-and-error coding is a lot less effective if iterations take 10s of seconds.

FredericBlum commented 2 years ago

I wouldn't know, because I have no experience with SQL. Two examples: Interpausal Units: Add up all durations of elements between two rows where ph == <p:>, divide the number of phones through this value Position: Does the preceding row have the same wd_ID? If no, the current phone is word-initial, if yes, it is not.

xrotwang commented 2 years ago

Ah, "adding up all the durations" reminds me that we really should model these using proper duration datatypes! And I admit that my first go at this would be in Python, too :)

xrotwang commented 2 years ago

Ah, duration seems to be just a decimal value in milliseconds - correct?

FredericBlum commented 2 years ago

Yes, exactly. Usually the first thing I do with this table is to convert everything in milliseconds.