Improve metadata reporting for `value-list` data

tzakharko commented 2 years ago

Variables of type value-list take values from a controlled dictionaries usually defined in the database itself (the definition files). The way how their values are currently represented in the metadata is not consistent. In some cases the metadata contains values that do not occur in the data and via versa. Some examples: GrammaticalMarkers::MarkerPositionBinned4, Register::OriginContinent, LocusOfMarkingPerMicrorelation:: FloatingorClitic etc.

Planned changes:

[ ] Improve the data validation procedures used by the export pipeline to catch such problems more reliably
[ ] Expose the explicit relationship between a variable and the controlled dictionary in the metadata
[ ] Metadata should only list the values that actually occur in the data

These improvements are planned for the release v1.1.0 in mid-late spring 2022.

This issue supersedes #23, #25, #27, #28, #29

xrotwang commented 2 years ago

FWIW, here's the complete list of

cases where there are no values defined or
undefined values in the data

no values: GrammaticalRelations Alignment SyntacticDomainCondition
no values: GrammaticalRelations GrammaticalRelations SyntacticDomainCondition
adding value Morphology DefaultLocusOfMarkingPerMacrorelation LocusOfMarkingBinned5 FloatingorClitic
no values: Morphology GrammaticalMarkers MarkerFusionBinned6
no values: Morphology GrammaticalMarkers MarkerBehaviorBinned4
adding value Morphology GrammaticalMarkers MarkerFusionBinned6 concatenative
adding value Morphology GrammaticalMarkers MarkerBehaviorBinned4 final
adding value Morphology GrammaticalMarkers MarkerPositionBinned4 split
adding value Morphology GrammaticalMarkers MarkerPositionBinned5 split
adding value Morphology GrammaticalMarkers MarkerFusionBinned6 stem
adding value Morphology GrammaticalMarkers MarkerBehaviorBinned4 spreading
adding value Morphology GrammaticalMarkers MarkerLocusBinned5 FloatingorClitic
adding value Morphology GrammaticalMarkers MarkerBehaviorBinned4 initial
adding value Morphology GrammaticalMarkers MarkerFusionBinned6 isolating
adding value Morphology GrammaticalMarkers MarkerPositionBinned4 in/simul
adding value Morphology GrammaticalMarkers MarkerLocusBinned6 D on H
adding value Morphology GrammaticalMarkers MarkerFusionBinned6 tonal
adding value Morphology GrammaticalMarkers MarkerBehaviorBinned4 on head
adding value Morphology GrammaticalMarkers MarkerFusionBinned6 reduplicative
adding value Morphology GrammaticalMarkers MarkerFusionBinned6 replacive
adding value NP NPStructure NPStructureDependentSemConstraintsBinned neutral
no values: Sentence ClauseLinkage IntuitiveClassification
no values: Sentence ClauseLinkage AnticipatoryArgumentMarking
no values: Sentence ClauseLinkage CataphoraConstraints
no values: Sentence ClauseLinkage CategoricalSymmetry
no values: Sentence ClauseLinkage ClauseLayer
no values: Sentence ClauseLinkage ClausePosition
no values: Sentence ClauseLinkage Embedding
no values: Sentence ClauseLinkage ExtractionConstraints
no values: Sentence ClauseLinkage FinitenessSimplified
no values: Sentence ClauseLinkage FocusMarkinginDependent
no values: Sentence ClauseLinkage FocusMarking
no values: Sentence ClauseLinkage IllocutionaryMarking
no values: Sentence ClauseLinkage IllocutionaryScope
no values: Sentence ClauseLinkage InterpropositionalSemanticRelation
no values: Sentence ClauseLinkage ReferenceTrackingSystem
no values: Sentence ClauseLinkage TenseMarking
no values: Sentence ClauseLinkage TenseScope
adding value Sentence ClauseLinkage IntuitiveClassification subordination
adding value Sentence ClauseLinkage AnticipatoryArgumentMarking none
adding value Sentence ClauseLinkage CategoricalSymmetry symmetrical
adding value Sentence ClauseLinkage ClauseLayer ad-S
adding value Sentence ClauseLinkage ClausePosition flexible-relational
adding value Sentence ClauseLinkage Embedding adjoined
adding value Sentence ClauseLinkage FinitenessSimplified finite
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation conditional
adding value Sentence ClauseLinkage TenseMarking ok
adding value Sentence ClauseLinkage IntuitiveClassification coordination
adding value Sentence ClauseLinkage CataphoraConstraints cataphora_impossible
adding value Sentence ClauseLinkage ClausePosition fixed:pre-main
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation inconsequential
adding value Sentence ClauseLinkage IntuitiveClassification ?
adding value Sentence ClauseLinkage CategoricalSymmetry asymmetrical
adding value Sentence ClauseLinkage FinitenessSimplified nonfinite
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation narrative
adding value Sentence ClauseLinkage TenseMarking banned
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation disjunction
adding value Sentence ClauseLinkage CataphoraConstraints cataphora_possible
adding value Sentence ClauseLinkage CategoricalSymmetry constraint-free
adding value Sentence ClauseLinkage FinitenessSimplified any
adding value Sentence ClauseLinkage FocusMarkinginDependent ok
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation topic
adding value Sentence ClauseLinkage ClausePosition flexible-adjacent
adding value Sentence ClauseLinkage ExtractionConstraints extraction_impossible
adding value Sentence ClauseLinkage FocusMarking banned
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation conjunction
adding value Sentence ClauseLinkage IntuitiveClassification cosubordination
adding value Sentence ClauseLinkage ExtractionConstraints extraction_possible
adding value Sentence ClauseLinkage FocusMarking ok
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation purposive
adding value Sentence ClauseLinkage AnticipatoryArgumentMarking obligatory
adding value Sentence ClauseLinkage TenseMarking harmonic
adding value Sentence ClauseLinkage FocusMarkinginDependent banned
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation causal
adding value Sentence ClauseLinkage ClauseLayer detached
adding value Sentence ClauseLinkage ClausePosition fixed:post-main
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation alternation
adding value Sentence ClauseLinkage ClauseLayer attr
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation ‘lest’
adding value Sentence ClauseLinkage ClauseLayer core
adding value Sentence ClauseLinkage Embedding subcategorized
adding value Sentence ClauseLinkage ClauseLayer ad-V
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation dependent
adding value Sentence ClauseLinkage InterpropositionalSemanticRelation manner
adding value Sentence ClauseWordOrder WordOrderAPLex AO
adding value Sentence ClauseWordOrder WordOrderAPLex OA

xrotwang commented 2 years ago

I'm also not quite sure what's going on with determining variables from the metadata. Here's my attempts at recreating the numbers from the summary stats table at https://github.com/autotyp/autotyp-data#data-coverage

Primary variables:

sqlite> select p.module, count(distinct p.cldf_id) from parametertable as p where kind like '%manual%' group by p.module;
Categories|14
GrammaticalRelations|33
Morphology|66
NP|17
Sentence|48
Word|37

Doesn't quite work out for GrammaticalRelations and Morphology.

This is corroborated by the numbers for derived variables:

sqlite> select p.module, count(distinct p.cldf_id) from parametertable as p where kind not like '%manual%' group by p.module;
Categories|7
GrammaticalRelations|9
Morphology|11
Word|3

While for "covered languages" Categories is the biggest mystery:

sqlite> select p.module, count(distinct v.cldf_languageReference) from parametertable as p, valuetable as v where v.cldf_parameterReference = p.cldf_id group by p.module;
Categories|505
GrammaticalRelations|800
Morphology|956
NP|485
Sentence|468
Word|76

tzakharko commented 2 years ago

I'm also not quite sure what's going on with determining variables from the metadata. Here's my attempts at recreating the numbers from the summary stats table at https://github.com/autotyp/autotyp-data#data-coverage

Primary variables:
sqlite> select p.module, count(distinct p.cldf_id) from parametertable as p where kind like '%manual%' group by p.module;
Categories|14
GrammaticalRelations|33
Morphology|66
NP|17
Sentence|48
Word|37
Doesn't quite work out for GrammaticalRelations and Morphology.

This is corroborated by the numbers for derived variables:
sqlite> select p.module, count(distinct p.cldf_id) from parametertable as p where kind not like '%manual%' group by p.module;
Categories|7
GrammaticalRelations|9
Morphology|11
Word|3

There are few things to consider here IMO:

Do you count nested data columns as a single variable or do you unnest them before counting? In our summaries, we count the variables inside the bested tables as well (e.g. GrammaticalRelationsRaw::SelectedArguments count as 11 variables).
The subdivision of datasets into modules looks like a tree in the export, but it is actually a system of tags. This means that some datasets that are stored under SummariesPerLanguage will count towards GrammaticalRelations or Morphology because that is where they are thematically ordered. Examples include AlignmentForDefaultPredicatesPerLanguage and MorphologyPerLanguage. You can't really replicate these counts because we don't export the full set tag.
There is of course certain degree of inflation as some variables are copies of others, just with the data reshaped in some
other way. We are working on improving how variable dependencies are tracked through aggregations, but it is not a trivial task. The accuracy of the data counts should improve with every releases, but there is still a lot work to do there.

You can also have a look at variables_overview.csv, there is a bit more information in there. Summaries we provide in there readme are based on the data that goes into that table.

While for "covered languages" Categories is the biggest mystery:

sqlite> select p.module, count(distinct v.cldf_languageReference) from parametertable as p, valuetable as v where v.cldf_parameterReference = p.cldf_id group by p.module;
Categories|505
GrammaticalRelations|800
Morphology|956
NP|485
Sentence|468
Word|76

Ah, welp, this is where implicit type conversion strikes again... there was a bug in how Gender data was annotated in the database and type system did the rest (it passed the validation layer because for the purpose of validation the conversion worked fine, only broke when we started counting things). Will be fixed in the next release.

xrotwang commented 2 years ago

Ok, when I store the number of nested variables as dim of the outer parameter, it still doesn't work. In particular the numbers for NP variables seem to be unexplainable. Counting the described variables with grep looks as follows:

$ grep "kind: " raw/autotyp-data/metadata/NP/NPStructure.yaml 
kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
        kind: manual data entry
        kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
    kind: manual data entry
        kind: manual data entry
        kind: manual data entry
    kind: manual data entry

That's 25 rows. Subtracting the module entry, and LID, Glottocode and Language, we get 21, two of which have two nested variables. So that should count as 19 primary variables, and no derived ones, right? Not 12 and 122.

xrotwang commented 2 years ago

I might get the 12 primary variables for NP, if I don't count the nested variables, and don't count the 2 variables of type logical, the 2 of type comment and the one of type list-of... :)

tzakharko commented 2 years ago

Here's the manually entered variables for NP (all from NPStructure)

- LID (not counted, since it's a language ID)
- Glottocode (not counted, since it's a language ID)
- Language (not counted, since it's a language ID)
NPStructureID
- NPStructureExample (not counted, since it's a comment field)
NPStructureType
NPStructureMarkingAssignmentType
NPStructureMarkingType
-NPStructureIsOvertlyMarked (binned variant of NPStructureMarkingType)
NPStructureMarkerID
NPStructureAgreement$NPStructureAgreementCategory
-NPStructureAgreement$NPStructureAgreementMacrocategory (binned variant of the previous variable)
NPStructureWordOrder
-NPStructureHasAlienability (binned variant of NPStructureAlienabilityType)
NPStructureAlienabilityType
NPStructureDependentPoSConstraints
NPStructureDependentSemConstraints
-NPStructureDependentSemConstraintsBinned (binned version of the previous variable)
NPStructureHeadlessness
NPStructureHeadSemConstraints$NPStructureHeadSemConstraints
-NPStructureHeadSemConstraints$NPStructureHeadSemConstraintsBinned (binned version of the previous variable)
-NPStructureHeadAdditionalConstraintsNotes (not counted, since it's a comment field)

Discounting all the flagged fields (comments, language IDs, field variants), that's 12 variables

And about computed variables: NPStructurePerLanguage and NPStructurePresence are two huge aggregated tables of NP properties that count towards these stats.

All of this info should be in variables_overview.csv

xrotwang commented 2 years ago

Ah, I see. Ok, will take variables_overview.csv into account, and recount.

tzakharko commented 2 years ago

If it helps, here is the R code used to do the counts. Note that it relies on some data that is not available in the public release. The variable all_typological_variables is basically a subset of the data you see in variables_overview.csv, with language IDs and comments removed.

report <- all_typological_variables %>%
  separate_rows(modules) %>%
  rename(Module = modules) %>%
  filter(!Module %in% "VerbInflection") %>%
  group_by(Module) %>%
  summarize(
    `Primary variables` = sum(kind == "manual entry"),
    `Derived variables` = sum(kind == "computed"),
    `Number of languages covered` = length(unique(unlist(LIDs))),
    `Number of primary typological datapoints` = approx_number(sum(
      n_values[kind == "manual entry" & dataset_kind == "primary"
    ])),
    .groups = "drop"
  )

report <- bind_rows(
  report,
  tibble(Module = " "),
  summarize(all_typological_variables,
    Module = "Total",
    `Primary variables` = sum(kind == "manual entry"),
    `Derived variables` = sum(kind == "computed"),
    `Number of languages covered` = length(unique(unlist(LIDs))),
    `Number of primary typological datapoints` = approx_number(sum(
      n_values[kind == "manual entry" & dataset_kind == "primary"
    ])),
    .groups = "drop"
  )
)

tzakharko commented 2 years ago

Fixed in v1.0.1

autotyp / autotyp-data

Improve metadata reporting for `value-list` data #30