cldf-datasets / doreco

CLDF dataset derived from DoReCo's core corpus
https://doreco.info/
3 stars 0 forks source link

Finalize query #24

Closed FredericBlum closed 1 year ago

FredericBlum commented 1 year ago

I am trying to finalize the query now, and make sure to address the following tasks:

Currently, I am at task 1). I have only made minor changes and tried to stay as close to the previous structure of the query as previously, but I still get an error thrown.

sqlite3.OperationalError: no such column: w.cldf_languageReference

So, I think the problem is that at some place in the query, it tries to join a table based on this info where it is not actually stored in the view. But where? @xrotwang Can you help me with the error tracking? I think now is a good time that I write the missing part myself, but I still need help in those aspects...

xrotwang commented 1 year ago

@Tarotis I cannot reproduce the error. With this PR checked out

$ git describe --all
heads/finalize_query

I ran the following (opting for including the ND data) without any errors:

cldfbench download cldfbench_doreco.py
cldfbench makecldf cldfbench_doreco.py --glottolog-version v4.8 --with-cldfreadme --with-zenodo
cldf createdb cldf doreco.sqlite
sqlite3 -echo doreco.sqlite < etc/views.sql
cldfbench doreco.query init_query.sql
FredericBlum commented 1 year ago

Ok, now I know where the problem lies: I have troubles updating the views. If I re-run everything, it works fine, but it seems that I cannot really overwrite the views by running the sqlite3 command again. Instead, I have to re-create the database. Is there a way around this?

xrotwang commented 1 year ago

Yes, we could do

DROP VIEW IF EXISTS <name>;

before an unconditional

CREATE VIEW <name> ...;
xrotwang commented 1 year ago

Sometimes, creating views can be computationally expensive, so you'd only want to do that if necessary. But this is not the case here. So switching all CREATE VIEW instances to DROP VIEW IF EXISTS ... CREATE VIEW as suggested above would be good.

FredericBlum commented 1 year ago

I now managed to include the word-count for all languages in the query.

Where I am failing is to compute the frequency of individual words now.

    ws.WordFreq AS freq,
    ls.WordCount AS count,
    (ws.WordFreq / ls.WordCount) AS word_frequency

This gives me the correct numbers for freq and count, but does throw 0's for the word_frequency. Why is that?

As I need to compute the standardized word_frequency as well, it might be best to separate this calculation from the SELECT command, correct? But would I put that towards a LEFT JOIN, or its own view, which calls other views?

FredericBlum commented 1 year ago

The DROP QUERY worked perfectly

FredericBlum commented 1 year ago

@xrotwang How long do you stay in Heidelberg? Maybe we find some time where we could work on this together? I think I could benefit a lot from that.

FredericBlum commented 1 year ago

In the current state, the computation seems fine, but there is a crucial problem: Not all rows receive the word-form frequency. I suspect that the reason may be the wordstats view, which is grouped on cldf_name, but selects for cldf_id. I am not really sure though. So the main idea was that I need to compute on cldf_name, but join on cldf_id. But somewhere, it goes wrong.

xrotwang commented 1 year ago

I guess we need two views, then. One wordformstats which groups by cldf_name and wordstats which groups be cldf_id. I'm not sure which terminology gets this across most transparently, but I think the difference is quite clear. Maybe wordstats should be word_in_utterance_stats or similar.

FredericBlum commented 1 year ago

I'm not quite sure how we would make use of the two queries. What would be the purpose of each?

xrotwang commented 1 year ago

"wordformstats" would need to be joined on w.cldf_name, "word_in_utterance_stats" on ph.word_id (or whatever it is). "wordformstats" would contribute the word frequency. (I don't remember what we need on word-in-utterance level).

FredericBlum commented 1 year ago

I have updated the query according to our discussion, but it seems to loop somewhere and doesn't finish. Could you help checking where this occurs? Probably in sdev or the query itself, as the formstats query runs smoothly. Alternatively, we could also check this using the tests you talked about.

xrotwang commented 1 year ago

@Tarotis just pushed a fix. Query runs fine on my side:

$ time cldfbench doreco.query init_query.sql > res.csv

real    1m44,199s
user    1m40,364s
sys     0m5,259s
$ wc -l res.csv 
867218 res.csv
FredericBlum commented 1 year ago

Thanks! I've added log-conversion for our counts (word form frequency, phones per word), and rounded the final values. It's all good now. Thanks a lot for all the work put into this, I am really glad with the results, and I think the SQL approach could become the default case for using the DoReCo data.