autotyp / autotyp-data

AUTOTYP data export
Creative Commons Attribution 4.0 International
38 stars 20 forks source link

Wrong data type specified for multiple variables in 1.0.0 #38

Closed xrotwang closed 2 years ago

xrotwang commented 2 years ago

The metadata specifies data: integer

  CaseMarkerExpressesMultipleCategories:
    description: >-
      Value of `GrammaticalMarkers::MarkerExpressesMultipleCategories` for exemplar "Case"
    kind: computed data (aggregation-scripts/GrammaticalMarkers.R)
    data: integer

but the JSON data is boolean, i.e. logical:

$ grep CaseMarkerExpressesMultipleCategories raw/autotyp-data/data/json/PerLanguageSummaries/GrammaticalMarkersPerLanguage.json -m 5
    "CaseMarkerExpressesMultipleCategories": false,
    "CaseMarkerExpressesMultipleCategories": false,
    "CaseMarkerExpressesMultipleCategories": true,
    "CaseMarkerExpressesMultipleCategories": false,
    "CaseMarkerExpressesMultipleCategories": true,
...
xrotwang commented 2 years ago

Same for three more variables:

xrotwang commented 2 years ago

logical instead of string of value-list:

  MarkerPositionForADefault:
    description: >-
      GrammaticalMarkers::MarkerPosition value for ADefault
    kind: computed data (aggregation-scripts/VerbInflectionPerLanguage.R)
    data: logical

with values like

$ grep MarkerPositionForADefault raw/autotyp-data/data/json/PerLanguageSummaries/* -m 5
raw/autotyp-data/data/json/PerLanguageSummaries/VerbAgreementAggregatedByMarkerPosition.json:    "MarkerPositionForADefault": "prae/post",
raw/autotyp-data/data/json/PerLanguageSummaries/VerbAgreementAggregatedByMarkerPosition.json:    "MarkerPositionForADefault": "post",
raw/autotyp-data/data/json/PerLanguageSummaries/VerbAgreementAggregatedByMarkerPosition.json:    "MarkerPositionForADefault": "post",
raw/autotyp-data/data/json/PerLanguageSummaries/VerbAgreementAggregatedByMarkerPosition.json:    "MarkerPositionForADefault": "Wackernagel",
raw/autotyp-data/data/json/PerLanguageSummaries/VerbAgreementAggregatedByMarkerPosition.json:    "MarkerPositionForADefault": "prae",
xrotwang commented 2 years ago

logical -> value-list:

            'MarkerPositionForADefault',
            'MarkerPositionForUDefault',
            'MarkerPositionForSDefault',
            'MarkerPositionForBDefault',
            'MarkerPositionForPOSSDefault',
            'MarkerPositionForARGNom',
            'MarkerPositionForOAdp',
            'MarkerPositionForAFF',
            'MarkerPositionForTDefault',
            'MarkerPositionForIDefault',
            'MarkerPositionForGDefault',
            'MarkerPositionForCore',
            'MarkerPositionForPat',
            'MarkerPositionForUBDefault',
xrotwang commented 2 years ago

Same for MarkerPositionBinned4/5.

tzakharko commented 2 years ago

@xrotwang I have pushed a new branch https://github.com/autotyp/autotyp-data/tree/fixes-1.0.1 that should fix all of these issues, including the missing value list definitions reported in #30, and probably more. It should also fix #35, #36. Also, variables that have no values at all (such as the case for GrammaticalRelations::SyntacticDomainCondition, #25).

Could you run the things on your side and see if these errors have disappeared?

bambooforest commented 2 years ago

@tzakharko -- from a potential user's perspective, will AUTOTYP data integrity testing be handled by conversion to CLDF in the future?

tzakharko commented 2 years ago

@bambooforest We have our own data validation layer, but it was only partially enabled for this release so it didn’t catch a class of problems. I plan to make the full pipeline public in due time once I clean things up.

There are probably some checks that CLDF library does that we don’t do and via versa, so validation during conversion will serve as a great sanity check.

bambooforest commented 2 years ago

Sounds good. Once the data validation layer is public, I suppose it'll be easy to add unit tests to the pipeline. Looking forward!

xrotwang commented 2 years ago

Will give it a spin later today.

tzakharko commented 2 years ago

Fixed in v1.0.1