clics / clicsbp

CLDF dataset on Body Part Colexifications
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

Emotion Concepts: Check with CLICS for Coverage #16

Closed LinguList closed 2 years ago

LinguList commented 2 years ago

@AnnikaTjuka, we have low coverage on emotion concepts since there are some datasets not in phonetic transcription, such as the Tanzania Language Survey, etc. I think it is important to check the CLICS website for all 20 emotion concepts to see if there are more datasets without phonetic transcription which we may need. If this is the case, we'd add the phonetic transcriptions in a rudimentary manner (the Tanzania Language Survey, for example, can be more or less mastered).

AnnikaTjuka commented 2 years ago

I checked the CLICS website for emotion concepts. Most emotion concepts occur in IDS (sometimes only there). Concepts with an ID > 3468 aren't listed in CLICS since they were added recently. I noted the datasets which we don't have in clicsbp/etc/datasets.tsv yet.

Not sure, if phonetic transcriptions are feasible for all of them:

LinguList commented 2 years ago

Lexi-Rumah can be added, it is just a different provider and we cannot update the data ourselves, but rely on them.

halenepal is available in phonetic transcriptions.

transnewguineaorg is also available, but you need to identify the language families of interest (ideally with > 10 members), and they can have smaller amounts of concepts, but this is something we should check.

TLS is a project that we MIGHT be able to turn to phonetic transcriptions by adding orthography profiles. A hard task, but I could start giving it a try end of January after my talks.

LinguList commented 2 years ago

For several parts of IDS, we can work with subsets, e.g., we have Panoan languages coded by John Miller. Here, we would have to ask John for permission to use the data before they have published something on them, but we may be able to publish something sooner.

LinguList commented 2 years ago

For caucasian languages, we may be able to pull them from IDS in phonological transcription that MIGHT be easy to modify. One should ask Ilia on this but emphasize that this is a pragmatic goal and no perfection intended (as he does not like the data).

AnnikaTjuka commented 2 years ago

Ok, I'll add halenepal first since it seems straightforward. I'm guessing that the others can't be added with one line in datasets.tsv, but I'll have a look at transnewguineaorg to see which language families could be added.

AnnikaTjuka commented 2 years ago

Looks like the number of language family members in "transnewguineaorg" is easy to find out thanks to Simon's overview: http://transnewguinea.org/family/?sort=-count

Family Count
Trans-New Guinea 687
Austronesian 40
Sepik 37
Lower Sepik-Ramu 36
Lakes Plain 32
Torricelli 28
Tor-Kwerba 14

I excluded the ones where there were more than 10 entries but most of them were dialects, for example, South-Central Papuan.

LinguList commented 2 years ago

You can see that a) Trans New Guina are all languages, and b) that the counts are quite low for individual langauge families. So one would have to dig specifically for langauge families with larger wordlists...

AnnikaTjuka commented 2 years ago

@LinguList I checked several lists that might contain emotion concepts, especially from the transnewguinea.com database, but could not find additional lists with at least 2-5 emotion concepts.

So out of 20 language families, 16 families don't have enough coverage of emotion concepts (and no colexifications were found):

The following lists don't have emotion concepts, but had a few color concepts:

Not sure what the best way forward is. Should we collect IDS data sets for each family?

LinguList commented 2 years ago

Means we need to go for IDS then.

LinguList commented 2 years ago

We'd need to start this as follows:

  1. check emotion concepts in the IDS list
  2. make a very quick commandline check for coverage for one individual language families and coverage of emotion concepts in IDS
  3. decide how difficult it is to make orthoprofiles

We don't need perfection for orthoprofiles, just some consistency, as in the Bantu database (I cannot think of the abbreviation now).

LinguList commented 2 years ago

I'll see if I find time to help with point 2. You, @AnnikaTjuka, might just quickly want to check the expected emotion concepts in IDS according to 1. There's no guarantee that they are reflected in all languages, which is why we need to check carefully here.

AnnikaTjuka commented 2 years ago

Sounds like a good plan! I'll make a list of emotion concepts in the IDS list.

AnnikaTjuka commented 2 years ago

One more thing: I just found a discrepancy in the emotion concepts themselves. My list, Tjuka-2021-192 included only nouns. But Jackson et al. used nouns, adjectives, and verbs. So for example, Tjuka-2021-192 includes HAPPYNESS whereas Jackson-2019 includes HAPPY.

When I add the 17 additional verb/adj concepts from Jackson et al., the coverage becomes better for 6 language families. But 10 language families still don't show colexifications for emotion concepts.

Should I search for more verb/adj concepts in the "Emotions and values" category that were added in the last Concepticon version? This could further improve the coverage but also makes the comparison a bit fuzzier, because colexifications between HAPPY and HAPPINESS would then be possible.

LinguList commented 2 years ago

Sorry, I missed that, but yes, this was the strategy to make sure that we'd find enough concepts with actual word forms: including verbs and adjectives. Like want and hope, etc. So what we could do is: we use the emotion concepts by Jackson et al as our reference, and the color and the bp terms by Tjuka 2021. Implementing this in clicsbp is easy.

LinguList commented 2 years ago

@AnnikaTjuka, overlap in concepts can be done via concepticon, concepticon intersection Concept-List-1 Concept-List-2, you know the command, right? If you use that with Jackson's list and the IDS list (Key-2016-XXX), it should give all overlapping concepts.

AnnikaTjuka commented 2 years ago

I see. The Jackson list has more overlap with IDS (17 emotion concepts) than Tjuka 2021 (8 emotion concepts). But if we use the Jackson list as a reference for the emotion concepts and add IDS, we are repeating the Jackson analysis instead of extending it. Another strategy would be to add verbs and adjectives to Tjuka 2021 (also new concepts from version 2.5.) and then test whether or not the coverage for emotion concepts is good enough without adding IDS. Would it be okay if I test this strategy first before we start adding IDS?

LinguList commented 2 years ago

Yes, by all means, as this may allow us to advance faster!

AnnikaTjuka commented 2 years ago

Ok, great. Will do it tomorrow.

AnnikaTjuka commented 2 years ago

I added 28 new emotion concepts to Tjuka 2021 (including the missing Jackson concepts). The coverage improves and now 12 out of 20 language families show colexifications in emotion concepts.

I also performed the ARI analysis and the graph looks very similar to Jackson:

ari

I will finalize the PRs for the changes in the list and the analysis by evening. I will also update Tjuka 2021 in Concepticon. I guess then we can tackle the power analysis with Daminán next week to see if the data is sufficient.

LinguList commented 2 years ago

If you update Tjuka 2021 in Concepticon, will it be out of sync in any way with your blog post?

AnnikaTjuka commented 2 years ago

Good point! Yes, it will be out of sync, because the number of emotion concepts discussed in the post won't match the list. Should I create a new version "Tjuka 2022" and add it to Concepticon instead?

LinguList commented 2 years ago

Yes, seems like the best plan. You could also consider writing a one-page update for our blog, so we can again have an updated reference? Really only one page, for documentation.

AnnikaTjuka commented 2 years ago

Sure! I'll write a draft next week.