cldf-datasets / doreco

CLDF dataset derived from DoReCo's core corpus
https://doreco.info/
3 stars 0 forks source link

update query #21

Closed FredericBlum closed 1 year ago

FredericBlum commented 1 year ago

@xrotwang I had now time to look at the updated query you provided. Thank you again for all this work. Trying to add the standardization and the word count, I am running into two problems:

Conceptually: Do we need two views for this? To get at the frequency, I need to divide the number of occurrences of a word (w.cldf_name) by the total words in the language (w.cldf_id). However, the first one I only get by grouping my data on language and w.cldf_name, and the second by only grouping on the language. Is there a way to combine those groupings in a single view?

Computationally: The way I set up things right now (two views, dividing their results) kills my laptop, and I do not understand why. Could you take a look at it and tell me what I am doing wrong, so that I can fix it?

All this is not urgent, as you can see by the fact that I as well did not have time to look at this previously. I very much appreciate all the learning I am getting about SQL by this though.

FredericBlum commented 1 year ago

ok, I have no idea how I just merged this, I just checked out "main" from the CLI and going on github suddenly the PR is merged? Feels very weird. Should we revert this?

xrotwang commented 1 year ago

I can have a look tomorrow

Frederic Blum @.***> schrieb am So., 23. Juli 2023, 14:05:

ok, I have no idea how I just merged this, I just checked out "main" from the CLI and going on github suddenly the PR is merged? Feels very weird. Should we revert this?

— Reply to this email directly, view it on GitHub https://github.com/cldf-datasets/doreco/pull/21#issuecomment-1646823128, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKEQ4IAZRMK7BPNCJMLXRUHRZANCNFSM6AAAAAA2ULICWQ . You are receiving this because you were mentioned.Message ID: @.***>

xrotwang commented 1 year ago

@Tarotis re stdev as window function: I think the difference between aggregate functions and window functions is well explained here https://www.sqlitetutorial.net/sqlite-window-functions/ . Basically, an aggregate function computes one value for a set of input rows, whereas a window functions computes a value for each row. stdev is an aggregate function, thus can't be used without a corresponding GROUP BY clause that reduces groups rows (the input for the aggregate function), but selects only one row per group.

Coming from SQL and looking at R, I was equally confused by this distinction being hidden via the "we are always working with one dataframe" paradigm :)

xrotwang commented 1 year ago

Re

Is there a way to combine those groupings in a single view?

What we do instead is pulling the two views into the main query appropriately (see https://github.com/cldf-datasets/doreco/blob/62c48ed13419ff30af1481dbe04e0362da860315/init_query.sql). Conceptually, I see these two views as lookup tables. For each row (i.e. phone) in the main query, we get the associated word and language, and then look up the wordstats and langstats.