Open balthasarbickel opened 7 months ago
The list of inflectional categories and the list of agreement categories are two logically distinct variables. In the structured data (R dump, JSON), these are nested tables with different dimensions. E.g. for Belhare
> filter(MaximallyInflectedVerbSynthesis, LID == 35)$VerbInflectionCategories[[1L]]
# A tibble: 8 × 8
VerbInflectionCategory VerbInflectionMacrocategory VerbInflectionMarkerPosition VerbInflectionMarkerPositionB…¹ VerbInflectionMarker…² VerbInflectionMarker…³ VerbInflectionMarker…⁴ VerbInflectionMarker…⁵
<fct> <fct> <fct> <fct> <fct> <lgl> <lgl> <lgl>
1 TAM TAM+ post post post FALSE TRUE FALSE
2 Polarity operators circum in/simul simul TRUE FALSE TRUE
3 Voice valence prae prae prae TRUE FALSE FALSE
4 Mood TAM+ prae/post split split TRUE TRUE FALSE
5 Valence valence post post post FALSE TRUE FALSE
6 Aktionsart TAM+ post post post FALSE TRUE FALSE
7 Connective inter-clausal post post post FALSE TRUE FALSE
8 Semistem other prae prae prae TRUE FALSE FALSE
# ℹ abbreviated names: ¹VerbInflectionMarkerPositionBinned4, ²VerbInflectionMarkerPositionBinned5, ³VerbInflectionMarkerHasPreposedExponent, ⁴VerbInflectionMarkerHasPostposedExponent,
# ⁵VerbInflectionMarkerHasMultipleExponents
> filter(MaximallyInflectedVerbSynthesis, LID == 35)$VerbAgreement[[1L]]
# A tibble: 2 × 7
VerbAgreementMicrorelation VerbAgreementMarkerPosition VerbAgreementMarkerPositionBinned4 VerbAgreementMarkerPositionBinned5 VerbAgreementMarkerHasPreposed…¹ VerbAgreementMarkerH…² VerbAgreementMarkerH…³
<fct> <fct> <fct> <fct> <lgl> <lgl> <lgl>
1 A-default prae/post split split TRUE TRUE FALSE
2 U-default prae/post split split TRUE TRUE FALSE
# ℹ abbreviated names: ¹VerbAgreementMarkerHasPreposedExponent, ²VerbAgreementMarkerHasPostposedExponent, ³VerbAgreementMarkerHasMultipleExponents
Since the CSV files are flat they cannot directly represent this type of differently nested data. So for the export we have decided to build a cartesian product of all nested values to preserve the relations (this is documented in the README, but probably could be made a bit more clear). But these are still independent variables.
The bottomline here is that one really shouldn't be using these CSV files, especially not for complex data. We include them only for folks who prefer to work with spreadsheets instead of R/Python.
Also, it might be more consistent to include VerbInflectionMaxCategoryCount and VerbInflectionMaxCategorySansAgreementCount in a separate SynthesisPerLanguage.csv file. Currently the information is simply the count of rows per language.
MaximallyInflectedVerbSynthesis is already a per-language dataset. The CSV just doesn't look like it because of the nested data expansion.
If you think the current way we handle CSVs is confusing, we can look at alternatives. There are four options that come to my mind:
Let me know if this answers your question and whether you think we need to change things. I will leave this issue open for now.
Unless I am missing something, I think there is a problem in how the synthesis data are exported. In
MaximallyInflectedVerbSynthesis.csv
, the agreement categories (e.g.A-default
andU-default
) don't appear as rows added to the other categories (e.g.Aktionsart
,Mood
) but as options within each of the other categories, in the same row (e.g. there is a rowAktionsart
....A-default
and a rowAktionsart
....U-default
). The summary counts underVerbInflectionMaxCategoryCount
andVerbInflectionMaxCategorySansAgreementCount
seem correct though.Also, it might be more consistent to include
VerbInflectionMaxCategoryCount
andVerbInflectionMaxCategorySansAgreementCount
in a separateSynthesisPerLanguage.csv
file. Currently the information is simply the count of rows per language.