autotyp / autotyp-data

AUTOTYP data export
Creative Commons Attribution 4.0 International
38 stars 20 forks source link

Synthesis data output #54

Open balthasarbickel opened 7 months ago

balthasarbickel commented 7 months ago

Unless I am missing something, I think there is a problem in how the synthesis data are exported. In MaximallyInflectedVerbSynthesis.csv, the agreement categories (e.g. A-default and U-default) don't appear as rows added to the other categories (e.g. Aktionsart, Mood) but as options within each of the other categories, in the same row (e.g. there is a row Aktionsart .... A-default and a row Aktionsart .... U-default). The summary counts under VerbInflectionMaxCategoryCount and VerbInflectionMaxCategorySansAgreementCount seem correct though.

Also, it might be more consistent to include VerbInflectionMaxCategoryCount and VerbInflectionMaxCategorySansAgreementCount in a separate SynthesisPerLanguage.csv file. Currently the information is simply the count of rows per language.

tzakharko commented 7 months ago

The list of inflectional categories and the list of agreement categories are two logically distinct variables. In the structured data (R dump, JSON), these are nested tables with different dimensions. E.g. for Belhare

> filter(MaximallyInflectedVerbSynthesis, LID == 35)$VerbInflectionCategories[[1L]]
# A tibble: 8 × 8
  VerbInflectionCategory VerbInflectionMacrocategory VerbInflectionMarkerPosition VerbInflectionMarkerPositionB…¹ VerbInflectionMarker…² VerbInflectionMarker…³ VerbInflectionMarker…⁴ VerbInflectionMarker…⁵
  <fct>                  <fct>                       <fct>                        <fct>                           <fct>                  <lgl>                  <lgl>                  <lgl>                 
1 TAM                    TAM+                        post                         post                            post                   FALSE                  TRUE                   FALSE                 
2 Polarity               operators                   circum                       in/simul                        simul                  TRUE                   FALSE                  TRUE                  
3 Voice                  valence                     prae                         prae                            prae                   TRUE                   FALSE                  FALSE                 
4 Mood                   TAM+                        prae/post                    split                           split                  TRUE                   TRUE                   FALSE                 
5 Valence                valence                     post                         post                            post                   FALSE                  TRUE                   FALSE                 
6 Aktionsart             TAM+                        post                         post                            post                   FALSE                  TRUE                   FALSE                 
7 Connective             inter-clausal               post                         post                            post                   FALSE                  TRUE                   FALSE                 
8 Semistem               other                       prae                         prae                            prae                   TRUE                   FALSE                  FALSE                 
# ℹ abbreviated names: ¹​VerbInflectionMarkerPositionBinned4, ²​VerbInflectionMarkerPositionBinned5, ³​VerbInflectionMarkerHasPreposedExponent, ⁴​VerbInflectionMarkerHasPostposedExponent,
#   ⁵​VerbInflectionMarkerHasMultipleExponents

> filter(MaximallyInflectedVerbSynthesis, LID == 35)$VerbAgreement[[1L]]
# A tibble: 2 × 7
  VerbAgreementMicrorelation VerbAgreementMarkerPosition VerbAgreementMarkerPositionBinned4 VerbAgreementMarkerPositionBinned5 VerbAgreementMarkerHasPreposed…¹ VerbAgreementMarkerH…² VerbAgreementMarkerH…³
  <fct>                      <fct>                       <fct>                              <fct>                              <lgl>                            <lgl>                  <lgl>                 
1 A-default                  prae/post                   split                              split                              TRUE                             TRUE                   FALSE                 
2 U-default                  prae/post                   split                              split                              TRUE                             TRUE                   FALSE                 
# ℹ abbreviated names: ¹​VerbAgreementMarkerHasPreposedExponent, ²​VerbAgreementMarkerHasPostposedExponent, ³​VerbAgreementMarkerHasMultipleExponents

Since the CSV files are flat they cannot directly represent this type of differently nested data. So for the export we have decided to build a cartesian product of all nested values to preserve the relations (this is documented in the README, but probably could be made a bit more clear). But these are still independent variables.

The bottomline here is that one really shouldn't be using these CSV files, especially not for complex data. We include them only for folks who prefer to work with spreadsheets instead of R/Python.

Also, it might be more consistent to include VerbInflectionMaxCategoryCount and VerbInflectionMaxCategorySansAgreementCount in a separate SynthesisPerLanguage.csv file. Currently the information is simply the count of rows per language.

MaximallyInflectedVerbSynthesis is already a per-language dataset. The CSV just doesn't look like it because of the nested data expansion.

If you think the current way we handle CSVs is confusing, we can look at alternatives. There are four options that come to my mind:

Let me know if this answers your question and whether you think we need to change things. I will leave this issue open for now.