concepticon / norare-data

Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts
Other
15 stars 1 forks source link

uniqueness of columns by description #97

Closed LinguList closed 4 years ago

LinguList commented 4 years ago

I just added a check, it spits out the following problems:

norare check
INFO    norare at /home/mattis/data/datasets/cldf/concepticon
INFO    checking Bond-2013-OMW
WARNING non-unique value ---- in Bond-2013-OMW / pos
WARNING non-unique value ---- in Bond-2013-OMW / wordnet_id
WARNING non-unique value ---- in Bond-2013-OMW / nltk_name
WARNING non-unique value ---- in Bond-2013-OMW / hypernyms
WARNING non-unique value ---- in Bond-2013-OMW / hyponyms
WARNING non-unique value ---- in Bond-2013-OMW / hypernym_names
WARNING non-unique value ---- in Bond-2013-OMW / hyponym_names
INFO    checking Alonso-2015-AoA
WARNING non-unique value ---- in Alonso-2015-AoA / spanish_pos
INFO    checking Brysbaert-2009-Frequency
INFO    checking Brysbaert-2011-Frequency
INFO    checking Brysbaert-2014-Concreteness
WARNING non-unique value ---- in Brysbaert-2014-Concreteness / english_pos
INFO    checking Brysbaert-2019-Prevalence
INFO    checking Cai-2010-Frequency
INFO    checking Cuetos-2011-Frequency
INFO    checking Desrochers-2009-SubjFrequency
INFO    checking Engelthaler-2018-Humor
INFO    checking Juhasz-2013-SER
INFO    checking Keuleers-2010-Frequency
INFO    checking Kuperman-2012-AoA
INFO    checking Riegel-2015-AffectiveRatings
WARNING non-unique value ---- in Riegel-2015-AffectiveRatings / german
WARNING non-unique value ---- in Riegel-2015-AffectiveRatings / polish_valence_mean_male
INFO    checking Scott-2019-Ratings
INFO    checking StadthagenGonzalez-2017-ValenceArousal
INFO    checking Starostin-2000-Sense
INFO    checking Warriner-2013-AffectiveRatings
INFO    checking Cortese-2008-AoA
INFO    checking Keuleers-2012-LexicalDecision
WARNING non-unique value ---- in Keuleers-2012-LexicalDecision / english_accuracy_mean
INFO    checking Ferrand-2010-LexicalDecision
INFO    checking GonzalezNosti-2014-LexicalDecision
INFO    checking Tsang-2018-LexicalDecision
INFO    checking Keuleers-2015-Prevalence
INFO    checking StadthagenGonzalez-2018-DiscreteEmotions
WARNING non-unique value ---- in StadthagenGonzalez-2018-DiscreteEmotions / spanish_pos
INFO    checking Alonso-2016-AoA
INFO    checking Imbir-2016-Ratings
WARNING non-unique value ---- in Imbir-2016-Ratings / english
INFO    checking Ferre-2017-DiscreteEmotions
WARNING non-unique value ---- in Ferre-2017-DiscreteEmotions / english
INFO    checking Wierzba-2015-DiscreteEmotions
WARNING non-unique value ---- in Wierzba-2015-DiscreteEmotions / german
INFO    checking Alonso-2011-OralFrequency
INFO    checking Lynott-2019-Sensorimotor
INFO    checking Kapucu-2018-EmotionRatings
INFO    checking Briesemeister-2011-DiscreteEmotions
INFO    checking Mandera-2015-Frequency
WARNING non-unique value ---- in Mandera-2015-Frequency / polish_pos
INFO    checking Moors-2013-Ratings
WARNING non-unique value ---- in Moors-2013-Ratings / english
WARNING non-unique value ---- in Moors-2013-Ratings / dutch_dominance_mean
WARNING non-unique value ---- in Moors-2013-Ratings / dutch_dominance_male_mean
WARNING non-unique value ---- in Moors-2013-Ratings / dutch_dominance_female_mean
WARNING non-unique value ---- in Moors-2013-Ratings / dutch_pos
INFO    checking Wu-2020-CoreVocabulary
INFO    checking Mohammad-2018-AffectiveRatings
INFO    checking Mohammad-2018-EmotionIntensity
INFO    checking Clark-2004-ImageryFamiliarity
WARNING non-unique value ---- in Clark-2004-ImageryFamiliarity / pavio_norms
INFO    checking Abdaoui-2017-EmoLex
INFO    checking Matisoff-2015-STEDT
WARNING non-unique value ---- in Matisoff-2015-STEDT / stedt_id
INFO    checking Kiss-1973-EAT
WARNING non-unique value ---- in Kiss-1973-EAT / node
WARNING non-unique value ---- in Kiss-1973-EAT / edges
INFO    checking Wikidata
WARNING non-unique value ---- in Wikidata / wikidata_label
INFO    checking Webster
INFO    checking OmegaWiki
INFO    checking LEGO
INFO    checking Babelnet
WARNING non-unique value ---- in Babelnet / babelnet_id
INFO    checking Numerals
INFO    checking Luniewska-2016-299
WARNING non-unique value ---- in Luniewska-2016-299 / gloss
WARNING non-unique value ---- in Luniewska-2016-299 / afrikaans
WARNING non-unique value ---- in Luniewska-2016-299 / catalan
WARNING non-unique value ---- in Luniewska-2016-299 / danish
WARNING non-unique value ---- in Luniewska-2016-299 / dutch
WARNING non-unique value ---- in Luniewska-2016-299 / english
WARNING non-unique value ---- in Luniewska-2016-299 / finnish
WARNING non-unique value ---- in Luniewska-2016-299 / german
WARNING non-unique value ---- in Luniewska-2016-299 / greek
WARNING non-unique value ---- in Luniewska-2016-299 / hebrew
WARNING non-unique value ---- in Luniewska-2016-299 / hungarian
WARNING non-unique value ---- in Luniewska-2016-299 / icelandic
WARNING non-unique value ---- in Luniewska-2016-299 / irish
WARNING non-unique value ---- in Luniewska-2016-299 / xhosa
WARNING non-unique value ---- in Luniewska-2016-299 / italian
WARNING non-unique value ---- in Luniewska-2016-299 / lithuanian
WARNING non-unique value ---- in Luniewska-2016-299 / luxembourgish
WARNING non-unique value ---- in Luniewska-2016-299 / maltese
WARNING non-unique value ---- in Luniewska-2016-299 / polish
WARNING non-unique value ---- in Luniewska-2016-299 / russian
WARNING non-unique value en-ratings-AoA-mean- in Luniewska-2016-299 / southafricanenglish_aoa
WARNING non-unique value ---- in Luniewska-2016-299 / southafricanenglish
WARNING non-unique value ---- in Luniewska-2016-299 / serbian
WARNING non-unique value ---- in Luniewska-2016-299 / slovak
WARNING non-unique value ---- in Luniewska-2016-299 / spanish
WARNING non-unique value ---- in Luniewska-2016-299 / swedish
WARNING non-unique value ---- in Luniewska-2016-299 / turkish
INFO    checking Baroni-2011-200
WARNING non-unique value ---- in Baroni-2011-200 / english
INFO    checking Luniewska-2019-299
WARNING non-unique value ---- in Luniewska-2019-299 / gloss
WARNING non-unique value ---- in Luniewska-2019-299 / afrikaans
WARNING non-unique value ---- in Luniewska-2019-299 / americanenglish
WARNING non-unique value ---- in Luniewska-2019-299 / catalan
WARNING non-unique value ---- in Luniewska-2019-299 / czech
WARNING non-unique value ---- in Luniewska-2019-299 / danish
WARNING non-unique value ---- in Luniewska-2019-299 / dutch
WARNING non-unique value en-ratings-AoA-mean- in Luniewska-2019-299 / english_aoa
WARNING non-unique value ---- in Luniewska-2019-299 / english
WARNING non-unique value ---- in Luniewska-2019-299 / finnish
WARNING non-unique value ---- in Luniewska-2019-299 / gaelic
WARNING non-unique value ---- in Luniewska-2019-299 / german
WARNING non-unique value ---- in Luniewska-2019-299 / greek
WARNING non-unique value ---- in Luniewska-2019-299 / hebrew
WARNING non-unique value ---- in Luniewska-2019-299 / hungarian
WARNING non-unique value ---- in Luniewska-2019-299 / icelandic
WARNING non-unique value ---- in Luniewska-2019-299 / irish
WARNING non-unique value ---- in Luniewska-2019-299 / xhosa
WARNING non-unique value ---- in Luniewska-2019-299 / italian
WARNING non-unique value ---- in Luniewska-2019-299 / lebanesearabic
WARNING non-unique value ---- in Luniewska-2019-299 / lithuanian
WARNING non-unique value ---- in Luniewska-2019-299 / luxembourgish
WARNING non-unique value ---- in Luniewska-2019-299 / maltese
WARNING non-unique value ---- in Luniewska-2019-299 / malay
WARNING non-unique value ---- in Luniewska-2019-299 / persian
WARNING non-unique value ---- in Luniewska-2019-299 / polish
WARNING non-unique value ---- in Luniewska-2019-299 / russian
WARNING non-unique value en-ratings-AoA-mean- in Luniewska-2019-299 / southafricanenglish_aoa
WARNING non-unique value ---- in Luniewska-2019-299 / southafricanenglish
WARNING non-unique value ---- in Luniewska-2019-299 / serbian
WARNING non-unique value ---- in Luniewska-2019-299 / slovak
WARNING non-unique value ---- in Luniewska-2019-299 / spanish
WARNING non-unique value ---- in Luniewska-2019-299 / swedish
WARNING non-unique value ---- in Luniewska-2019-299 / turkish
WARNING non-unique value ---- in Luniewska-2019-299 / westernarmenian
INFO    checking Monnier-2014-1031
WARNING non-unique value ---- in Monnier-2014-1031 / french
WARNING non-unique value ---- in Monnier-2014-1031 / english
WARNING non-unique value ---- in Monnier-2014-1031 / pos
WARNING non-unique value ---- in Monnier-2014-1031 / picture_source
WARNING non-unique value ---- in Monnier-2014-1031 / picture_number
WARNING non-unique value ---- in Monnier-2014-1031 / number_letters
WARNING non-unique value ---- in Monnier-2014-1031 / number_phonemes
WARNING non-unique value ---- in Monnier-2014-1031 / number_syllables
WARNING non-unique value ---- in Monnier-2014-1031 / subtlex_freq
WARNING non-unique value ---- in Monnier-2014-1031 / boks_freq
WARNING non-unique value ---- in Monnier-2014-1031 / imageability_mean
INFO    checking Winter-2016-300
WARNING non-unique value ---- in Winter-2016-300 / english
WARNING non-unique value ---- in Winter-2016-300 / random_set
WARNING non-unique value ---- in Winter-2016-300 / participants
INFO    checking Yao-2017-1100
WARNING non-unique value ---- in Yao-2017-1100 / english
WARNING non-unique value ---- in Yao-2017-1100 / chinese
INFO    checking Verheyen-2019-1000
WARNING non-unique value ---- in Verheyen-2019-1000 / dutch
WARNING non-unique value ---- in Verheyen-2019-1000 / english
WARNING non-unique value ---- in Verheyen-2019-1000 / nchar
WARNING non-unique value ---- in Verheyen-2019-1000 / syllable_length
WARNING non-unique value ---- in Verheyen-2019-1000 / bigram
WARNING non-unique value ---- in Verheyen-2019-1000 / neighbor
WARNING non-unique value ---- in Verheyen-2019-1000 / freq_celex
WARNING non-unique value ---- in Verheyen-2019-1000 / freq_subtlex
INFO    checking Lynott-2009-423
WARNING non-unique value ---- in Lynott-2009-423 / english
WARNING non-unique value ---- in Lynott-2009-423 / familiarity
WARNING non-unique value ---- in Lynott-2009-423 / visual_sd
WARNING non-unique value ---- in Lynott-2009-423 / haptic_sd
WARNING non-unique value ---- in Lynott-2009-423 / auditory_sd
WARNING non-unique value ---- in Lynott-2009-423 / olfactory_sd
WARNING non-unique value ---- in Lynott-2009-423 / gustatory_sd
WARNING non-unique value ---- in Lynott-2009-423 / bnc_freq
WARNING non-unique value ---- in Lynott-2009-423 / bnc_log_freq
WARNING non-unique value ---- in Lynott-2009-423 / word_length
INFO    checking Maciejewski-2016-100
WARNING non-unique value ---- in Maciejewski-2016-100 / english
WARNING non-unique value ---- in Maciejewski-2016-100 / homophone_meaning_number
WARNING non-unique value ---- in Maciejewski-2016-100 / freq_meaning_1_american_en
WARNING non-unique value ---- in Maciejewski-2016-100 / freq_raw_bnc
WARNING non-unique value ---- in Maciejewski-2016-100 / freq_log
WARNING non-unique value ---- in Maciejewski-2016-100 / bigram_log_bnc
WARNING non-unique value ---- in Maciejewski-2016-100 / number_wordsense_wordnet
WARNING non-unique value ---- in Maciejewski-2016-100 / semantic_diversity
WARNING non-unique value ---- in Maciejewski-2016-100 / imageability
WARNING non-unique value ---- in Maciejewski-2016-100 / concreteness
WARNING non-unique value ---- in Maciejewski-2016-100 / familiarity
INFO    checking Lynott-2013-400
WARNING non-unique value ---- in Lynott-2013-400 / english
WARNING non-unique value ---- in Lynott-2013-400 / auditory_sd
WARNING non-unique value ---- in Lynott-2013-400 / gustatory_sd
WARNING non-unique value ---- in Lynott-2013-400 / haptic_sd
WARNING non-unique value ---- in Lynott-2013-400 / olfactory_sd
WARNING non-unique value ---- in Lynott-2013-400 / visual_sd
INFO    checking Izura-2005-499
WARNING non-unique value ---- in Izura-2005-499 / spanish
WARNING non-unique value ---- in Izura-2005-499 / english
WARNING non-unique value ---- in Izura-2005-499 / word_length
INFO    checking Pagel-2018-200
WARNING non-unique value ---- in Pagel-2018-200 / english
INFO    checking Rzymski-2020-1624
WARNING non-unique value ---- in Rzymski-2020-1624 / english
INFO    checking DiezAlamo-2018-750
WARNING non-unique value ---- in DiezAlamo-2018-750 / spanish
WARNING non-unique value ---- in DiezAlamo-2018-750 / english
WARNING non-unique value ---- in DiezAlamo-2018-750 / body_object_interaction
WARNING non-unique value ---- in DiezAlamo-2018-750 / concreteness
WARNING non-unique value ---- in DiezAlamo-2018-750 / imageability
WARNING non-unique value ---- in DiezAlamo-2018-750 / semantic_category
INFO    checking Xiao-2012-213
WARNING non-unique value ---- in Xiao-2012-213 / chinese
WARNING non-unique value ---- in Xiao-2012-213 / pinyin
WARNING non-unique value ---- in Xiao-2012-213 / english
WARNING non-unique value ---- in Xiao-2012-213 / curriculum_order
WARNING non-unique value ---- in Xiao-2012-213 / freq_per_million
INFO    checking Lewis-2016-499
WARNING non-unique value ---- in Lewis-2016-499 / english
WARNING non-unique value ---- in Lewis-2016-499 / nchars
WARNING non-unique value ---- in Lewis-2016-499 / _confidenceintervalhigh
WARNING non-unique value ---- in Lewis-2016-499 / _confidenceintervallow
INFO    checking Pagel-2007-200
WARNING non-unique value ---- in Pagel-2007-200 / english
WARNING non-unique value ---- in Pagel-2007-200 / part_of_speech
WARNING non-unique value ---- in Pagel-2007-200 / states
WARNING non-unique value ---- in Pagel-2007-200 / mean_rate
WARNING non-unique value ---- in Pagel-2007-200 / rate_sd
WARNING non-unique value ---- in Pagel-2007-200 / english_frequency
WARNING non-unique value ---- in Pagel-2007-200 / spanish_frequency
WARNING non-unique value ---- in Pagel-2007-200 / russian_frequency
WARNING non-unique value ---- in Pagel-2007-200 / greek_frequency
INFO    checking Schroeder-2012-824
WARNING non-unique value ---- in Schroeder-2012-824 / german
WARNING non-unique value ---- in Schroeder-2012-824 / english
WARNING non-unique value ---- in Schroeder-2012-824 / typicality_sd
WARNING non-unique value ---- in Schroeder-2012-824 / aoa_sd
WARNING non-unique value ---- in Schroeder-2012-824 / familiarity_sd
WARNING non-unique value ---- in Schroeder-2012-824 / freq_dlex
WARNING non-unique value ---- in Schroeder-2012-824 / freq_log10_dlex
WARNING non-unique value ---- in Schroeder-2012-824 / word_length_phoneme
WARNING non-unique value ---- in Schroeder-2012-824 / word_length_syllable
INFO    checking Gampe-2017-48
WARNING non-unique value ---- in Gampe-2017-48 / english
WARNING non-unique value ---- in Gampe-2017-48 / code
INFO    checking Hill-2015-999
WARNING non-unique value ---- in Hill-2015-999 / english
WARNING non-unique value ---- in Hill-2015-999 / wordpair
WARNING non-unique value ---- in Hill-2015-999 / pos
INFO    checking Dellert-2018-1016
WARNING non-unique value ---- in Dellert-2018-1016 / gloss
WARNING non-unique value ---- in Dellert-2018-1016 / german
WARNING non-unique value ---- in Dellert-2018-1016 / nel_id
WARNING non-unique value ---- in Dellert-2018-1016 / lgc_sd
WARNING non-unique value ---- in Dellert-2018-1016 / ranking_value
INFO    checking Desrochers-2010-330
WARNING non-unique value ---- in Desrochers-2010-330 / spanish
WARNING non-unique value ---- in Desrochers-2010-330 / english
WARNING non-unique value ---- in Desrochers-2010-330 / number_letters
WARNING non-unique value ---- in Desrochers-2010-330 / freq_log_lexesp
WARNING non-unique value ---- in Desrochers-2010-330 / familiarity_lexesp
LinguList commented 4 years ago

Not all are real problems, but to narrow this down, we need to start by:

  1. marking all language glosses by "gloss" in one of the main fields, e.g., "other", then I can exclude them from the text (right now, I have no apparent way of telling from the column description in norare, what "english" refers to, but we need to know that it is a gloss for a given language here.

  2. marking ID columns (nel_id, stedt_id) as "identifier", so we can exlude them as well.

  3. resolving the other parts (lgc_sd, for example, is ambiguous with some other aspect there, etc.).

LinguList commented 4 years ago

You have to git-pull the most recent version, @AnnikaTjuka, from pynorare, to be able to use the norare check command. This is a first step towards having more testing.

AnnikaTjuka commented 4 years ago

Not sure if I understand the check correctly. Does it use the datasets as a basis or norare.tsv? For example, in norare.tsv we do not include "SD" columns.

LinguList commented 4 years ago

It uses norare.tsv.

LinguList commented 4 years ago

Wait, this may be different for concepticon concept lists. I'll check.

LinguList commented 4 years ago

My error, I did check for all columns, will try and correct this now.

LinguList commented 4 years ago

Perfect, now it is fixed:

$ norare check
INFO    norare at /home/mattis/data/datasets/cldf/concepticon
INFO    checking Bond-2013-OMW
INFO    checking Alonso-2015-AoA
INFO    checking Brysbaert-2009-Frequency
INFO    checking Brysbaert-2011-Frequency
INFO    checking Brysbaert-2014-Concreteness
INFO    checking Brysbaert-2019-Prevalence
INFO    checking Cai-2010-Frequency
INFO    checking Cuetos-2011-Frequency
INFO    checking Desrochers-2009-SubjFrequency
INFO    checking Engelthaler-2018-Humor
INFO    checking Juhasz-2013-SER
INFO    checking Keuleers-2010-Frequency
INFO    checking Kuperman-2012-AoA
INFO    checking Riegel-2015-AffectiveRatings
INFO    checking Scott-2019-Ratings
INFO    checking StadthagenGonzalez-2017-ValenceArousal
INFO    checking Starostin-2000-Sense
INFO    checking Warriner-2013-AffectiveRatings
INFO    checking Cortese-2008-AoA
INFO    checking Keuleers-2012-LexicalDecision
INFO    checking Ferrand-2010-LexicalDecision
INFO    checking GonzalezNosti-2014-LexicalDecision
INFO    checking Tsang-2018-LexicalDecision
INFO    checking Keuleers-2015-Prevalence
INFO    checking StadthagenGonzalez-2018-DiscreteEmotions
INFO    checking Alonso-2016-AoA
INFO    checking Imbir-2016-Ratings
INFO    checking Ferre-2017-DiscreteEmotions
INFO    checking Wierzba-2015-DiscreteEmotions
INFO    checking Alonso-2011-OralFrequency
INFO    checking Lynott-2019-Sensorimotor
INFO    checking Kapucu-2018-EmotionRatings
INFO    checking Briesemeister-2011-DiscreteEmotions
INFO    checking Mandera-2015-Frequency
INFO    checking Moors-2013-Ratings
INFO    checking Wu-2020-CoreVocabulary
INFO    checking Mohammad-2018-AffectiveRatings
INFO    checking Mohammad-2018-EmotionIntensity
INFO    checking Clark-2004-ImageryFamiliarity
INFO    checking Abdaoui-2017-EmoLex
INFO    checking Matisoff-2015-STEDT
INFO    checking Kiss-1973-EAT
INFO    checking Wikidata
INFO    checking Webster
INFO    checking OmegaWiki
INFO    checking LEGO
INFO    checking Babelnet
INFO    checking Numerals
INFO    checking Pagel-2007-200
INFO    checking Winter-2016-300
INFO    checking DiezAlamo-2018-750
INFO    checking Hill-2015-999
INFO    checking Izura-2005-499
INFO    checking Yao-2017-1100
INFO    checking Rzymski-2020-1624
INFO    checking Verheyen-2019-1000
INFO    checking Maciejewski-2016-100
INFO    checking Baroni-2011-200
INFO    checking Dellert-2018-1016
INFO    checking Luniewska-2019-299
WARNING non-unique value en-ratings-AoA-mean- in Luniewska-2019-299 / english_aoa
WARNING non-unique value en-ratings-AoA-mean- in Luniewska-2019-299 / southafricanenglish_aoa
INFO    checking Lewis-2016-499
INFO    checking Schroeder-2012-824
INFO    checking Lynott-2009-423
INFO    checking Pagel-2018-200
INFO    checking Desrochers-2010-330
INFO    checking Monnier-2014-1031
INFO    checking Gampe-2017-48
INFO    checking Lynott-2013-400
INFO    checking Xiao-2012-213
INFO    checking Luniewska-2016-299
WARNING non-unique value en-ratings-AoA-mean- in Luniewska-2016-299 / southafricanenglish_aoa
AnnikaTjuka commented 4 years ago

Yes, this looks much better :)

Concerning the Warnings in the Luniewska data set. I couldn't find an ISO gloss for southafricanenglish. Should I just shorten it to "seen" or "ae"?

LinguList commented 4 years ago

No, I suggest you add "South-African English" to "other". Since I use the two-letter-codes to create the flags in the app...

AnnikaTjuka commented 4 years ago

Ok!