Closed xrotwang closed 1 year ago
The SQL database seems great for all cases that use the data directly. But for all kinds of preprocessing, I would still have to loop through the individual rows, right?
What kinds of preprocessing do you mean?
For my study, for example, I calculate the log-count of phonemes per second in all interpausal units, or annotate the position of phonemes within their word. This is why I assume that looping through the data-frame is necessary anyway.
All calculations involve a lot of looking at preceding and succeding data points. Computationally heavy, but conceptually simple I guess.
If narrowing down to "interpausal units" is simple to do with a SQL query, you might be able to cut down the amount of stuff to loop through considerably. From my experience, trial-and-error coding is a lot less effective if iterations take 10s of seconds.
I wouldn't know, because I have no experience with SQL. Two examples:
Interpausal Units: Add up all durations of elements between two rows where ph == <p:>
, divide the number of phones through this value
Position: Does the preceding row have the same wd_ID? If no, the current phone is word-initial, if yes, it is not.
Ah, "adding up all the durations" reminds me that we really should model these using proper duration datatypes! And I admit that my first go at this would be in Python, too :)
Ah, duration
seems to be just a decimal
value in milliseconds - correct?
Yes, exactly. Usually the first thing I do with this table is to convert everything in milliseconds.
Working with the ~2Mio rows in
phones.csv
viapycldf
might be kind of cumbersome due to the slowness. Now you could pay the price onceand get fast access later