Compute proper ratings from "raw" conceptlists - Githubissues

concepticon / norare-data

Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts

Other

15 stars 1 forks source link

Compute proper ratings from "raw" conceptlists #184

Closed xrotwang closed 1 year ago

xrotwang commented 2 years ago

E.g. for Scheible 2014 it doesn't make a lot of sense to list the three columns IDS_IN_SOURCE,RELATION_TYPE,SCORES as three variables, when in reality they must be composed properly to convey any meaning. So maybe we should have a way to synthesize variables (with the help of a bit of python code) from such lists.

For anything in concept_set_meta we already have a place for such code. Maybe we could have something similar for conceptlists from Concepticon. E.g. have a module concept_set_meta/Scheible-2014-1755/__init__.py, which is imported when creating the CLDF data. This module could have a function per variable specified in norare.tsv, e.g.

def SCORED_RELATIONS(conceptlist):
    # compute the scored relations as list dicts
    return []

In the Scheible 2014 case, values of SCORED_RELATIONS might be dicts like

{
    "abziehen": {"ANT": 5.5, "HYP": 0, "SYN": 0.8333},
    "anhäufen": {"ANT": 1.3333, "HYP": 0.6667, "SYN": 2.5 }, 
    ...
}

computed from the "raw" column values

3378:überziehen-abziehen:0 3382:überziehen-anhäufen:0 3598:überziehen-ziehen:0 3601:überziehen-anziehen:0 3786:überziehen-umhüllen:0 3789:überziehen-verschulden:0 4357:überziehen-abziehen:0 4358:überziehen-anhäufen:0 4359:überziehen-anziehen:0 4360:überziehen-umhüllen:0 4361:überziehen-verschulden:0 4362:überziehen-ziehen:0 4506:überziehen-abziehen:0 4510:überziehen-anhäufen:0 4726:überziehen-ziehen:0 4729:überziehen-anziehen:0 4914:überziehen-umhüllen:0 4917:überziehen-verschulden:0

ANT ANT ANT ANT ANT ANT HYP HYP HYP HYP HYP HYP SYN SYN SYN SYN SYN SYN

5.5 1.3333 0.3333 0.5 1.1667 0.3333 0 0.6667 6.1667 6.6667 4.8333 2.5 0.8333 2.5 1.6667 6 6.8333 4.5

xrotwang commented 2 years ago

What do you think @LinguList @AnnikaTjuka ?

xrotwang commented 2 years ago

Lapesa 2014 seems to be a similar case, where lists of values for multiple columns need to be combined to create a meaningful variable:

253:to-accustom-to-know:0 254:to-accustom-to-alienate:0 255:to-accustom-to-familiarize:0
→381 →29 →252
poli1 poli1 poli1
freqMed freqMed freqMed
LOW NOT HIGH
SYN SYN SYN
2.7 0 4.2
-2.6 0.1 4.3

AnnikaTjuka commented 2 years ago

Yes, there are several cases where we have this kind of network data. I guess @LinguList knows best how to proceed with these.

LinguList commented 2 years ago

Yes, sorry for answering late: for network data, it would be extremely useful to have a graph class for internal representation instead of having to construct graphs in each comparison again.

LinguList commented 2 years ago

We discuss networks in two blogposts. https://calc.hypotheses.org/2684

LinguList commented 2 years ago

https://calc.hypotheses.org/2684

xrotwang commented 2 years ago

@LinguList Your "graph class" seems to be roughly compatible with my proposal for the Scheible data above:

{
    "abziehen": {"ANT": 5.5, "HYP": 0, "SYN": 0.8333},
    "anhäufen": {"ANT": 1.3333, "HYP": 0.6667, "SYN": 2.5 }, 
    ...
}

right?

LinguList commented 2 years ago

Yes.

LinguList commented 2 years ago

I think it is the same spirit: when creating these current representations we thought of some extended representation that could be done via some script.

LinguList commented 2 years ago

Ideally, your proposal adds one additional test to the data.

xrotwang commented 2 years ago

I think cases like this also highlight the usefulness of NoRaRe as a place where conceptsets are mapped to typed data - rather than just a couple of columns in a CSV - as in Concepticon.

LinguList commented 2 years ago

Yes. Also a beautiful point to be made in a paper.

xrotwang commented 2 years ago

Ok, so I'll add support for this to pynorare then?

xrotwang commented 2 years ago

I have a working pynorare now, which does the right thing, given a Scheible-2014-1755/map.py looking like

import collections

from pynorare.dataset import NormDataSet
from csvw import Column

def compute_scored_relations(row):
    row['SCORED_RELATIONS'] = collections.defaultdict(dict)
    for name, reltype, score in zip(row['IDS_IN_SOURCE'], row['RELATION_TYPE'], row['SCORES']):
        row['SCORED_RELATIONS'][name.split(':')[1]][reltype] = float(score)
    return row

class Dataset(NormDataSet):
    id = "Scheible-2014-1755"

    def map(self, write_file=True):
        tg = self.concepticon.conceptlists[self.id].tg
        items = [compute_scored_relations(row) for row in tg.tables[0]]
        tg.tables[0].tableSchema.columns.append(
            Column.fromvalue(dict(name="SCORED_RELATIONS", datatype="json")))
        tg.write(self.meta.norare_dsdir / self.mdname, **{self.fname: items})

In terms of the amount of custom code, I think that's acceptable.

AnnikaTjuka commented 1 year ago

@xrotwang @LinguList Is this already implemented? Or should we keep it for the next version 1.1?

xrotwang commented 1 year ago

Implemented, so can be closed. Somewhat related is https://github.com/concepticon/concepticon-data/issues/1217 - and I'd suggest to move that issue to a later milestone (in concepticon-data).

AnnikaTjuka commented 1 year ago

Ok, great. Will move the other one.

xrotwang commented 1 year ago

Vulic-2020-2244 seems to be a similar case.

xrotwang commented 1 year ago

It seems that the SimLex list in concepticon is missing the translations of the second word for each word-pair similarity rating. E.g. https://concepticon.clld.org/values/Vulic-2020-2244-3 should not have these lists of three times the translation of the main concept, but the translations of the SIMLEX_GLOSSES. Fixing this would be a prerequisite to then aggregating the similarity ratings per concept and language into something like

{"arm": 1.5, "bone": 3.4}

...

LinguList commented 1 year ago

I am not sure I understand this. The word in question is English muscle, linked to the conceptico concept set MUSCLE.

Simlex Glosses are ['arm', 'tongue', 'bone']. The IDS give us information how to construct the original SIMLEX wordpairs:

['1:2', '690:1', '696:1']

muscle:arm tongue:muscle bone:muscle

In this case, the English word form is the same, but since this is not always the case, we have the info on

English_in_source: ['muscle', 'muscle', 'muscle']

etc.

This only relates to the gloss in question.

If one wants to retrieve the counterpart, that is, the concept to which the base gloss is linked, one needs to check in the links:

Links: ['306', '62', '172']

Entry Vulic-2020-2244-306

is linked to concept set ARM here.

So the information is there, but yes, one could add ALL the other glosses that are linked to as well for each concept, but it was not considered important when creating the file, as our primary interest is the reflection of the key word form we look at here.

And note that the repetition of ['muscle', 'muscle', 'muscle'] is important, since we have cases, where we have a language with different words here, which we judged to be synonyms manually, when creating the database.

xrotwang commented 1 year ago

Ah, I see. Ok, will use this structure to compute a useful representation for NoRaRe. I think in particular in the NoRaRe case, it is important to have meaningful individual values for variables. So extracting just a list of numbers here (e.g. ENGLISH_SCORE = [0.45, 0.76, 0.34]) would seem non-sensical. Maybe

{"arm": 1.5, "bone": 3.4}

is a bit too terse, but something like

{"arm": [1.5, "Vulic-2020-2244-306"], ...}

could be reasonable.

LinguList commented 1 year ago

Yes, I see, and I agree that a better representation is surely needed, given the confusion it causes and the headache it causes myself all the time when I try to re-understand the data we created there...