Closed FredericBlum closed 1 year ago
@Tarotis I'll integrate corpora citations and a "full" example in the tutorial in a new PR.
@Tarotis here's a query that does most of what you need (although not the z_
stuff - but you might fiddle this in yourself):
SELECT
phone.*,
word.cldf_languageReference AS Language,
word.speaker_id AS Speaker,
utt.speech_rate AS SpeechRate,
utt.num_phones AS count_ipu,
CASE
WHEN phone.cldf_id in (select cldf_id from utterance_initials) then 1 else 0
END utt_initial,
CASE
WHEN phone.cldf_id in (select cldf_id from word_initials) then 1 else 0
END word_initial,
CASE
WHEN sound.cldf_cltsReference like '%stop%' then 'stop' else 'dunno'
END sound_class,
utt.log_speech_rate,
ws.WordFreq
FROM
"phones.csv" AS phone,
"words.csv" AS word,
ParameterTable AS sound,
wordstats AS ws
LEFT JOIN
(
SELECT
w.speaker_id, avg(p.duration) + 3 * stdev(p.duration) AS threshold
FROM
`phones.csv` AS p,
`words.csv` AS w
WHERE
p.cldf_parameterreference IS NOT NULL AND
p.wd_id = w.cldf_id
GROUP BY w.speaker_id
) AS t
ON
word.speaker_id = t.speaker_id
LEFT JOIN
utterances AS utt
ON
phone.u_ID = utt.u_id
WHERE
ws.cldf_languageReference = word.cldf_languageReference AND
phone.wd_id = word.cldf_id AND
phone.cldf_parameterReference = sound.cldf_id AND
sound.cldf_cltsReference LIKE '%_consonant' AND
sound.cldf_cltsReference NOT LIKE '%click%' AND
sound.cldf_cltsReference NOT LIKE '%implosive%' AND
sound.cldf_cltsReference NOT LIKE '%ejective%' AND
sound.cldf_cltsReference NOT LIKE '%long%' AND
NOT (phone.cldf_id IN (select cldf_id FROM utterance_initials) AND sound.cldf_cltsReference LIKE '%stop%
') AND
phone.duration < t.threshold
;
It requires another view:
CREATE VIEW IF NOT EXISTS wordstats AS
SELECT
w.cldf_languageReference,
count(distinct w.cldf_name) / count(w.cldf_id) as WordFreq
FROM
"words.csv" as w
GROUP BY w.cldf_languageReference;
Here's a lightly commented version:
SELECT
phone.*,
word.cldf_languageReference AS Language,
word.speaker_id AS Speaker,
utt.speech_rate AS SpeechRate,
utt.num_phones AS count_ipu,
CASE
WHEN phone.cldf_id in (select cldf_id from utterance_initials) then 1 else 0
END utt_initial,
CASE
WHEN phone.cldf_id in (select cldf_id from word_initials) then 1 else 0
END word_initial,
CASE
WHEN sound.cldf_cltsReference like '%stop%' then 'stop' else 'dunno'
END sound_class,
utt.log_speech_rate,
ws.WordFreq
FROM
"phones.csv" AS phone,
"words.csv" AS word,
ParameterTable AS sound,
wordstats AS ws -- language-level stats on words.
LEFT JOIN
(
SELECT
-- Here we compute the threshold for exclusion of unusually long phones.
w.speaker_id, avg(p.duration) + 3 * stdev(p.duration) AS threshold
FROM
`phones.csv` AS p,
`words.csv` AS w
WHERE
p.cldf_parameterreference IS NOT NULL AND
p.wd_id = w.cldf_id
GROUP BY w.speaker_id -- Thresholds are computed per speaker.
) AS t
ON
word.speaker_id = t.speaker_id
LEFT JOIN
utterances AS utt -- utterance-level stats such as speech rate.
ON
phone.u_ID = utt.u_id
WHERE
ws.cldf_languageReference = word.cldf_languageReference AND
phone.wd_id = word.cldf_id AND
phone.cldf_parameterReference = sound.cldf_id AND
-- We only consider non-long, pulmonic consonants ...
sound.cldf_cltsReference LIKE '%_consonant' AND
sound.cldf_cltsReference NOT LIKE '%click%' AND
sound.cldf_cltsReference NOT LIKE '%implosive%' AND
sound.cldf_cltsReference NOT LIKE '%ejective%' AND
sound.cldf_cltsReference NOT LIKE '%long%' AND
-- ... and exclude utterance-initial stops.
NOT (phone.cldf_id IN (select cldf_id FROM utterance_initials) AND sound.cldf_cltsReference LIKE '%stop%
') AND
-- We also exclude phonemes with unusually long durations, which hint at annotation errors.
phone.duration < t.threshold
;
Runs in ~40secs on my machine:
$ time cldfbench doreco.query --format tsv fullquery.sql > res.tsv
real 0m40,850s
user 0m38,607s
sys 0m3,830s
$ du -sh res.tsv
157M res.tsv
Thanks for setting all this up @xrotwang . I've just run the tutorial again, and it all works great. I am really amazed by the utility of SQL, and am really looking forward to learn more. However, I am still quite slow, especially when it comes to the combination of multiple complex queries. There are now two open questions from my side: