Synthesis data output - Githubissues

The list of inflectional categories and the list of agreement categories are two logically distinct variables. In the structured data (R dump, JSON), these are nested tables with different dimensions. E.g. for Belhare

> filter(MaximallyInflectedVerbSynthesis, LID == 35)$VerbInflectionCategories[[1L]]
# A tibble: 8 × 8
  VerbInflectionCategory VerbInflectionMacrocategory VerbInflectionMarkerPosition VerbInflectionMarkerPositionB…¹ VerbInflectionMarker…² VerbInflectionMarker…³ VerbInflectionMarker…⁴ VerbInflectionMarker…⁵
  <fct>                  <fct>                       <fct>                        <fct>                           <fct>                  <lgl>                  <lgl>                  <lgl>                 
1 TAM                    TAM+                        post                         post                            post                   FALSE                  TRUE                   FALSE                 
2 Polarity               operators                   circum                       in/simul                        simul                  TRUE                   FALSE                  TRUE                  
3 Voice                  valence                     prae                         prae                            prae                   TRUE                   FALSE                  FALSE                 
4 Mood                   TAM+                        prae/post                    split                           split                  TRUE                   TRUE                   FALSE                 
5 Valence                valence                     post                         post                            post                   FALSE                  TRUE                   FALSE                 
6 Aktionsart             TAM+                        post                         post                            post                   FALSE                  TRUE                   FALSE                 
7 Connective             inter-clausal               post                         post                            post                   FALSE                  TRUE                   FALSE                 
8 Semistem               other                       prae                         prae                            prae                   TRUE                   FALSE                  FALSE                 
# ℹ abbreviated names: ¹VerbInflectionMarkerPositionBinned4, ²VerbInflectionMarkerPositionBinned5, ³VerbInflectionMarkerHasPreposedExponent, ⁴VerbInflectionMarkerHasPostposedExponent,
#   ⁵VerbInflectionMarkerHasMultipleExponents

> filter(MaximallyInflectedVerbSynthesis, LID == 35)$VerbAgreement[[1L]]
# A tibble: 2 × 7
  VerbAgreementMicrorelation VerbAgreementMarkerPosition VerbAgreementMarkerPositionBinned4 VerbAgreementMarkerPositionBinned5 VerbAgreementMarkerHasPreposed…¹ VerbAgreementMarkerH…² VerbAgreementMarkerH…³
  <fct>                      <fct>                       <fct>                              <fct>                              <lgl>                            <lgl>                  <lgl>                 
1 A-default                  prae/post                   split                              split                              TRUE                             TRUE                   FALSE                 
2 U-default                  prae/post                   split                              split                              TRUE                             TRUE                   FALSE                 
# ℹ abbreviated names: ¹VerbAgreementMarkerHasPreposedExponent, ²VerbAgreementMarkerHasPostposedExponent, ³VerbAgreementMarkerHasMultipleExponents

Since the CSV files are flat they cannot directly represent this type of differently nested data. So for the export we have decided to build a cartesian product of all nested values to preserve the relations (this is documented in the README, but probably could be made a bit more clear). But these are still independent variables.

The bottomline here is that one really shouldn't be using these CSV files, especially not for complex data. We include them only for folks who prefer to work with spreadsheets instead of R/Python.

Also, it might be more consistent to include VerbInflectionMaxCategoryCount and VerbInflectionMaxCategorySansAgreementCount in a separate SynthesisPerLanguage.csv file. Currently the information is simply the count of rows per language.

MaximallyInflectedVerbSynthesis is already a per-language dataset. The CSV just doesn't look like it because of the nested data expansion.

If you think the current way we handle CSVs is confusing, we can look at alternatives. There are four options that come to my mind:

cartesian expansion (what we do now), this preserves the internal relations between variables but results in duplication
no expansion, just dump the tables (this is what we used first but changed it since it was deemed confusing), this does not preserve the internal relations between the variables but does not duplicate the rows
collapse the nested tables, this would preserve the top structure but produce a huge mess of text for nested data
output multiple CSVs that preserve relational structure and create dummy keys

Let me know if this answers your question and whether you think we need to change things. I will leave this issue open for now.

autotyp / autotyp-data

Synthesis data output #54