clld / glottolog3

glottolog2 re-implemented as CLLD app
MIT License
114 stars 27 forks source link

Filter for sub-families in Glottoscope? #135

Closed joeylovestrand closed 2 years ago

joeylovestrand commented 3 years ago

Would it be possible to allow Glottoscope to filter for sub-families (e.g. Chadic and not just all Afro-Asiatic)?

I would particularly be interested in having the Tally numbers of levels of description.

(Will be including a Glottoscope map in my next presentation - so thanks for this tool!)

HedvigS commented 3 years ago

Good idea! It might be tricky to decide what level of subfamily to use, but if they settled on something like the WALS-genera sets that might work.

xrotwang commented 3 years ago

WALS genera are not defined for all Glottolog families and not all WALS genera have corresponding Glottolog subgroups. I think the only pragmatic (in term of UI) option is allowing drilling down one more level from top-level families.

HedvigS commented 3 years ago

Okay. I think that's fixable (the matching of WALS-genera and glottolog subgroups), but nevermind.

The technique of using just the first-order went a bit weird elsewhere at the Grambank-website, didn't it? For example, when a language-leveled languoid is the direct descendant of the root etc. But, sure that does sound the easiest UI-wise.

Bibiko commented 3 years ago

Maybe one simple solution would be, since levels and sub-groups are not clearly defined, to allow to set any glottocode as parameter (so to speak the start point of a tree). As for Chadic, one would enter chad1250 to get this branch only. This would work for any scenario, even for dialects of a single language.

xrotwang commented 2 years ago

@joeylovestrand would a cookbook recipe work for you? Something along the lines of the code below - although I'll add a bit more description of what's going on for a recipe.

$ csvgrep -c Parameter_ID -r"^classification" cldf/values.csv | csvgrep -c Value -m "chad1250" | csvcut -c Language_ID > chadic.csv
$ csvgrep -c Parameter_ID -r"^med$" cldf/values.csv > meds.csv
$ csvjoin -c Language_ID chadic.csv meds.csv | csvstat
  1. "Language_ID"

    Type of data:          Text
    Contains null values:  False
    Unique values:         206
    Longest value:         8 characters
    Most common values:    suku1272 (1x)
                           mina1276 (1x)
                           mbed1242 (1x)
                           gava1241 (1x)
                           buwa1243 (1x)

  2. "ID"

    Type of data:          Text
    Contains null values:  False
    Unique values:         206
    Longest value:         12 characters
    Most common values:    suku1272-med (1x)
                           mina1276-med (1x)
                           mbed1242-med (1x)
                           gava1241-med (1x)
                           buwa1243-med (1x)

  3. "Parameter_ID"

    Type of data:          Text
    Contains null values:  False
    Unique values:         1
    Longest value:         3 characters
    Most common values:    med (206x)

  4. "Value"

    Type of data:          Number
    Contains null values:  False
    Unique values:         5
    Smallest value:        0
    Largest value:         4
    Sum:                   517
    Mean:                  2,51
    Median:                3
    StDev:                 1,457
    Most common values:    4 (77x)
                           2 (47x)
                           0 (33x)
                           3 (33x)
                           1 (16x)

  5. "Code_ID"

    Type of data:          Text
    Contains null values:  False
    Unique values:         5
    Longest value:         21 characters
    Most common values:    med-wordlist_or_less (77x)
                           med-grammar_sketch (47x)
                           med-long_grammar (33x)
                           med-phonology_or_text (33x)
                           med-grammar (16x)

  6. "Comment"

    Type of data:          Boolean
    Contains null values:  True (excluded from calculations)
    Unique values:         1
    Most common values:    None (206x)

  7. "Source"

    Type of data:          Text
    Contains null values:  False
    Unique values:         169
    Longest value:         41 characters
    Most common values:    hh:hvw:JungraithmayrIbriszimow:CLR (11x)
                           hh:hw:Kraft:Chadic:II (5x)
                           hh:hs:Schuh:Bole-Tangale (5x)
                           hh:hw:Kraft:Chadic:III (4x)
                           hh:w:Brye:Jimjimen-Gude-Tsuvan-Sharwa (3x)

  8. "codeReference"

    Type of data:          Boolean
    Contains null values:  True (excluded from calculations)
    Unique values:         1
    Most common values:    None (206x)

Row count: 206
xrotwang commented 2 years ago

@joeylovestrand here's the draft: https://github.com/glottolog/cookbook/blob/master/recipes/glottolog_cldf/documentation_status_for_subgroup.md

joeylovestrand commented 2 years ago

@xrotwang Thanks for this! I haven't used the cookbook, but it looks similar enough to R/Python that I assume I could figure it out. Was certainly easier to have Harald do it for me 😁 but it will be great to be able to update the numbers on my own!