cldf-datasets / doreco

CLDF dataset derived from DoReCo's core corpus
https://doreco.info/
3 stars 0 forks source link

Tutorial and running my model #19

Closed FredericBlum closed 1 year ago

FredericBlum commented 1 year ago

Thanks for setting all this up @xrotwang . I've just run the tutorial again, and it all works great. I am really amazed by the utility of SQL, and am really looking forward to learn more. However, I am still quite slow, especially when it comes to the combination of multiple complex queries. There are now two open questions from my side:

xrotwang commented 1 year ago

@Tarotis I'll integrate corpora citations and a "full" example in the tutorial in a new PR.

xrotwang commented 1 year ago

@Tarotis here's a query that does most of what you need (although not the z_ stuff - but you might fiddle this in yourself):

SELECT
    phone.*,
    word.cldf_languageReference AS Language,
    word.speaker_id AS Speaker,
    utt.speech_rate AS SpeechRate,
    utt.num_phones AS count_ipu,
    CASE
        WHEN phone.cldf_id in (select cldf_id from utterance_initials) then 1 else 0
        END utt_initial,
    CASE
        WHEN phone.cldf_id in (select cldf_id from word_initials) then 1 else 0
        END word_initial,
    CASE
        WHEN sound.cldf_cltsReference like '%stop%' then 'stop' else 'dunno'
        END sound_class,
    utt.log_speech_rate,
    ws.WordFreq
FROM
    "phones.csv" AS phone,
    "words.csv" AS word,
    ParameterTable AS sound,
    wordstats AS ws
LEFT JOIN
    (
        SELECT
            w.speaker_id, avg(p.duration) + 3 * stdev(p.duration) AS threshold
        FROM
            `phones.csv` AS p,
            `words.csv` AS w
        WHERE
            p.cldf_parameterreference IS NOT NULL AND
            p.wd_id = w.cldf_id
        GROUP BY w.speaker_id
    ) AS t
ON
    word.speaker_id = t.speaker_id
LEFT JOIN
    utterances AS utt
ON
    phone.u_ID = utt.u_id
WHERE
    ws.cldf_languageReference = word.cldf_languageReference AND
    phone.wd_id = word.cldf_id AND
    phone.cldf_parameterReference = sound.cldf_id AND
    sound.cldf_cltsReference LIKE '%_consonant' AND
    sound.cldf_cltsReference NOT LIKE '%click%' AND
    sound.cldf_cltsReference NOT LIKE '%implosive%' AND
    sound.cldf_cltsReference NOT LIKE '%ejective%' AND
    sound.cldf_cltsReference NOT LIKE '%long%' AND
    NOT (phone.cldf_id IN (select cldf_id FROM utterance_initials) AND sound.cldf_cltsReference LIKE '%stop%
') AND
    phone.duration < t.threshold
;

It requires another view:

CREATE VIEW IF NOT EXISTS wordstats AS                                 
    SELECT                                                             
        w.cldf_languageReference,                                      
        count(distinct w.cldf_name) / count(w.cldf_id) as WordFreq     
    FROM                                                               
        "words.csv" as w                                               
    GROUP BY w.cldf_languageReference;
xrotwang commented 1 year ago

Here's a lightly commented version:

SELECT
    phone.*,
    word.cldf_languageReference AS Language,
    word.speaker_id AS Speaker,
    utt.speech_rate AS SpeechRate,
    utt.num_phones AS count_ipu,
    CASE
        WHEN phone.cldf_id in (select cldf_id from utterance_initials) then 1 else 0
        END utt_initial,
    CASE
        WHEN phone.cldf_id in (select cldf_id from word_initials) then 1 else 0
        END word_initial,
    CASE
        WHEN sound.cldf_cltsReference like '%stop%' then 'stop' else 'dunno'
        END sound_class,
    utt.log_speech_rate,
    ws.WordFreq
FROM
    "phones.csv" AS phone,
    "words.csv" AS word,
    ParameterTable AS sound,
    wordstats AS ws  -- language-level stats on words.
LEFT JOIN
    (
        SELECT
            -- Here we compute the threshold for exclusion of unusually long phones.
            w.speaker_id, avg(p.duration) + 3 * stdev(p.duration) AS threshold
        FROM
            `phones.csv` AS p,
            `words.csv` AS w
        WHERE
            p.cldf_parameterreference IS NOT NULL AND
            p.wd_id = w.cldf_id
        GROUP BY w.speaker_id  -- Thresholds are computed per speaker.
    ) AS t
ON
    word.speaker_id = t.speaker_id
LEFT JOIN
    utterances AS utt  -- utterance-level stats such as speech rate.
ON
    phone.u_ID = utt.u_id
WHERE
    ws.cldf_languageReference = word.cldf_languageReference AND
    phone.wd_id = word.cldf_id AND
    phone.cldf_parameterReference = sound.cldf_id AND
    -- We only consider non-long, pulmonic consonants ...
    sound.cldf_cltsReference LIKE '%_consonant' AND
    sound.cldf_cltsReference NOT LIKE '%click%' AND
    sound.cldf_cltsReference NOT LIKE '%implosive%' AND
    sound.cldf_cltsReference NOT LIKE '%ejective%' AND
    sound.cldf_cltsReference NOT LIKE '%long%' AND
    -- ... and exclude utterance-initial stops.
    NOT (phone.cldf_id IN (select cldf_id FROM utterance_initials) AND sound.cldf_cltsReference LIKE '%stop%
') AND
    -- We also exclude phonemes with unusually long durations, which hint at annotation errors.
    phone.duration < t.threshold
;
xrotwang commented 1 year ago

Runs in ~40secs on my machine:

$ time cldfbench doreco.query --format tsv fullquery.sql > res.tsv 

real    0m40,850s
user    0m38,607s
sys 0m3,830s
$ du -sh res.tsv 
157M    res.tsv