autotyp / autotyp-data

AUTOTYP data export
Creative Commons Attribution 4.0 International
38 stars 20 forks source link

MaximallyInflectedVerbSynthesis - variables lost in CLDF #51

Closed annagraff closed 1 year ago

annagraff commented 1 year ago

The following autotyp variables are not available in the CLDF:

[1] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasBipartiteStem"
[2] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasNounIncorporation"
[3] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasVerbIncorporation"
[4] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionExponenceType"
[5] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeCount"
[6] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsPhonologicallyCoherent"
[7] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsSyntacticallyCoherent"
[8] "autotyp_MaximallyInflectedVerbSynthesis_VerbAgreement"
[9] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasAnyIncorporation"
[10] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasNounOrVerbIncorporation"
[11] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategoryCount"
[12] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategorySansAgreementCount" [13] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeSansAgreementCount" [14] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsProsodicallyCoherent"
[15] "autotyp_MaximallyInflectedVerbSynthesis_IsVerbAgreementSurveyComplete"
[16] "autotyp_MaximallyInflectedVerbSynthesis_IsVerbInflectionSurveyComplete"
[17] "autotyp_MaximallyInflectedVerbSynthesis_VerbIncorporation"
[18] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionCategories"

annagraff commented 1 year ago

the only variable available in the dataset "MaximallyInflectedVerbSynthesis" seems to be called MaximallyInflectedVerbSynthesis as well and is faulty

tzakharko commented 1 year ago

@annagraff Some data in AUTOTYP uses nested tables, MaximallyInflectedVerbSynthesis is one of such datasets. Since CLDF structured datasets do not directly support nesting, we export them as JSON. See this comment by Robert for more information: https://github.com/autotyp/autotyp-data/issues/2#issuecomment-1036035401

I you use Python, I'd recommend you to use the JSON export instead of CLDF for the complex datasets as it gives you direct access to the structure and is more straightforward to work with.

I hope this addresses your issue. Please reopen it if I missed something.

nataliacp commented 1 year ago

I am reopening this with a proposal to increase data reusability. Right now, the variables listed in the first comment in this thread are within a JSON format under the MaximallyInflectedVerbSynthesis umbrella variable. Most of these variables though are simple binary per-language variables and they could be incorporated straightforwardly in the CLDF format. The only problem is that the only values for these variables that can be trusted are for the languages that are TRUE for both housekeeping variables (IsVerbAgreementSurveyComplete and IsVerbInflectionSurveyComplete). What do you think about this proposal @tzakharko and @xrotwag?

tzakharko commented 1 year ago

Hi @nataliacp, I have reopened the issue per your request.

If I understand the problem correctly, it should indeed be possible to treat each variable separately and thus only encode nested variables for the CLDF export. This would however require a substantial rewrite of the CLDF export pipeline. Definitely something I am interested in tackling, but unfortunately I cannot assign high priority to this.

xrotwang commented 1 year ago

@nataliacp @tzakharko Technically, this sounds doable. One could turn the list of sets with complex data (https://github.com/cldf-datasets/autotypcldf/blob/aa0801e6c442c7b4a0607955d81b872e8890b739/cldfbench_autotypcldf.py#L205-L225) into a dict, mapping set IDs to callables, which return additional, synthetic parameter values (late-late aggregation so to speak :) ).

nataliacp commented 1 year ago

Thank you for the quick responses! Do you think that this could be done easily and relatively soon? @annagraff and I are using these variables and we are base everything on the cldf export. So, either we extract those variables to be used internally in Anna's pipeline, or we can wait for this to be implemented in autotyp so things are more straightforward and transparent.

tzakharko commented 1 year ago

@xrotwang The particular dataset in question is per-language if I remember correctly, so one probably won't need a new level of parameters — just rewrite the entire thing to process variables instead of tables. Of course, it won't be as simple for datasets that describe constructions instead of languages. If a general solution is required, one would probably need to do specialised mapping for each complex dataset to preserve the data semantics.

@nataliacp I won't be able to tackle this in the near future unfortunately. Maybe you or @annagraff could submit a PR? Alternatively you can try using the JSON export directly.

annagraff commented 1 year ago

Unfortunately, my pipeline is entirely in R, so my local fix for this would not be easily convertible to a pull request for the full CLDF generation pipeline. However, if @xrotwang thinks that a full solution is desireable, I could wait for it to be implemented, if this happens within the next month or so.

tzakharko commented 1 year ago

If your pipeline is in R, why don't you use the R export directly? That's the cleanest data representation you have at your disposal.

annagraff commented 1 year ago

I am working with 5 databases, one of which is autotyp. The others are all available in CLDF but not R format, and it would make my code much messier to write a specific pipeline for autotyp, especially if CLDF is also available here. The only autotyp module I am having problems with is this Synthesis module, because all other modules I care about are identical between the two data structures CLDF and RData

tzakharko commented 1 year ago

Ok, this makes sense, I understand your predicament.

What would be a good way to move forward here? I won't be able to address this in a reasonable time frame. Unless someone submits a PR that takes care of this change, it might be the least painful thing for you @annagraff to implement a workaround and extract these variables from the embedded JSON.

xrotwang commented 1 year ago

I could have a closer look sometime this week - and if things are as expected, I might be able to provide a PR later next week.

(An analysis using multiple databases and profiting from all being available in CLDF is too good advertisement :) )

nataliacp commented 1 year ago

thank you Robert! I hope it goes smoothly!

xrotwang commented 1 year ago

I don't know how you access the CLDF data from R. But if you happen to use the "CLDF-via-SQLite" approach (see https://github.com/cldf/cookbook/tree/master/recipes/cldf_r#working-with-cldf-via-sqlite), you could exploit the pretty good JSON support built into SQLite:

sqlite> select 
    l.cldf_name, json_extract(v.cldf_value, '$.VerbHasBipartiteStem') 
from
    valuetable as v, languagetable as l 
where 
    v.cldf_languageReference = l.cldf_id and v.cldf_parameterReference = 68
limit 20;
Language Value
Belhare 1
Hungarian 0
English 0
Warlpiri 0
Fijian (Boumaa) 0
Thai 0
Mandarin 0
Ingush 1
Songhai (Koyra Chiini) 0
Yoruba 1
Persian 1
Ainu
Rama 1
Russian 0
Georgian 0
Karen (Sgaw)
Cree (Plains) 1
Lakhota 1
Adyghe (West Circassian)
Zapotec (Isthmus) 0
xrotwang commented 1 year ago

SQLite's extract_json function shouldn't be too hard to use via dbplyr, see https://dbplyr.tidyverse.org/articles/translation-function.html#unknown-functions

xrotwang commented 1 year ago

For completeness:

select
    l.cldf_name, json_extract(v.cldf_value, '$.VerbHasBipartiteStem') 
from 
    valuetable as v, languagetable as l 
where 
    v.cldf_languageReference = l.cldf_id and 
    v.cldf_parameterReference = 68 and 
    json_extract(v.cldf_value, '$.IsVerbInflectionSurveyComplete') = 1 and 
    json_extract(v.cldf_value, '$.IsVerbAgreementSurveyComplete') = 1;

seems to be what we want here.

xrotwang commented 1 year ago

On @annagraff 's list above there are three (complex) list-valued variables:

I guess these shouldn't be available as regular parameters, correct @nataliacp @annagraff ?

annagraff commented 1 year ago

Yes, this is correct! @xrotwang

xrotwang commented 1 year ago

@annagraff does this look like what you'd expect:

select 
    p.cldf_name, count(v.cldf_id) 
from 
    parametertable as p, valuetable as v 
where 
    v.cldf_parameterreference = p.cldf_id and p.cldf_name like '%Maximally%'
group by p.cldf_name;
Name Langs
MaximallyInflectedVerbSynthesis 451
MaximallyInflectedVerbSynthesis_VerbHasAnyIncorporation 235
MaximallyInflectedVerbSynthesis_VerbHasBipartiteStem 137
MaximallyInflectedVerbSynthesis_VerbHasNounIncorporation 235
MaximallyInflectedVerbSynthesis_VerbHasNounOrVerbIncorporation 235
MaximallyInflectedVerbSynthesis_VerbHasVerbIncorporation 235
MaximallyInflectedVerbSynthesis_VerbInflectionExponenceType 220
MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategoryCount 235
MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategorySansAgreementCount 235
MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeCount 220
MaximallyInflectedVerbSynthesis_VerbIsPhonologicallyCoherent 88
MaximallyInflectedVerbSynthesis_VerbIsProsodicallyCoherent 131
MaximallyInflectedVerbSynthesis_VerbIsSyntacticallyCoherent 131
MaximallyInflectedVerbSynthesis_VerbProsodicCoherencyNotes 48
annagraff commented 1 year ago

It does! Except the first "MaximallyInflectedVerbSynthesis" is no longer necessary, since it is the original entry with JSON syntax, right?

xrotwang commented 1 year ago

It does! Except the first "MaximallyInflectedVerbSynthesis" is no longer necessary, since it is the original entry with JSON syntax, right?

It also includes the list-valued data. So for completeness, it should still be there, I think.

xrotwang commented 1 year ago

@tzakharko Here's the changes needed to make this happen: https://github.com/autotyp/autotyp-cldf-scripts/pull/1/files

tzakharko commented 1 year ago

@xrotwang Thanks for the patch!

@annagraff Can you please check that your pipeline works as expected with https://github.com/autotyp/autotyp-data/tree/version-1.1.1 If everything is fine, I will push it as a minor release.

annagraff commented 1 year ago

Many many thanks to both of you, @tzakharko @xrotwang!

It works and looks exactly the way it should. You can push this as a minor release, @tzakharko

Have a nice weekend!

tzakharko commented 1 year ago

New version has been released and pushed to Zenodo. Thanks everyone!