Closed xrotwang closed 1 year ago
What do you think @LinguList @AnnikaTjuka ?
Lapesa 2014 seems to be a similar case, where lists of values for multiple columns need to be combined to create a meaningful variable:
253:to-accustom-to-know:0 254:to-accustom-to-alienate:0 255:to-accustom-to-familiarize:0
→381 →29 →252
poli1 poli1 poli1
freqMed freqMed freqMed
LOW NOT HIGH
SYN SYN SYN
2.7 0 4.2
-2.6 0.1 4.3
Yes, there are several cases where we have this kind of network data. I guess @LinguList knows best how to proceed with these.
Yes, sorry for answering late: for network data, it would be extremely useful to have a graph class for internal representation instead of having to construct graphs in each comparison again.
We discuss networks in two blogposts. https://calc.hypotheses.org/2684
@LinguList Your "graph class" seems to be roughly compatible with my proposal for the Scheible data above:
{
"abziehen": {"ANT": 5.5, "HYP": 0, "SYN": 0.8333},
"anhäufen": {"ANT": 1.3333, "HYP": 0.6667, "SYN": 2.5 },
...
}
right?
Yes.
I think it is the same spirit: when creating these current representations we thought of some extended representation that could be done via some script.
Ideally, your proposal adds one additional test to the data.
I think cases like this also highlight the usefulness of NoRaRe as a place where conceptsets are mapped to typed data - rather than just a couple of columns in a CSV - as in Concepticon.
Yes. Also a beautiful point to be made in a paper.
Ok, so I'll add support for this to pynorare
then?
I have a working pynorare
now, which does the right thing, given a Scheible-2014-1755/map.py
looking like
import collections
from pynorare.dataset import NormDataSet
from csvw import Column
def compute_scored_relations(row):
row['SCORED_RELATIONS'] = collections.defaultdict(dict)
for name, reltype, score in zip(row['IDS_IN_SOURCE'], row['RELATION_TYPE'], row['SCORES']):
row['SCORED_RELATIONS'][name.split(':')[1]][reltype] = float(score)
return row
class Dataset(NormDataSet):
id = "Scheible-2014-1755"
def map(self, write_file=True):
tg = self.concepticon.conceptlists[self.id].tg
items = [compute_scored_relations(row) for row in tg.tables[0]]
tg.tables[0].tableSchema.columns.append(
Column.fromvalue(dict(name="SCORED_RELATIONS", datatype="json")))
tg.write(self.meta.norare_dsdir / self.mdname, **{self.fname: items})
In terms of the amount of custom code, I think that's acceptable.
@xrotwang @LinguList Is this already implemented? Or should we keep it for the next version 1.1?
Implemented, so can be closed. Somewhat related is https://github.com/concepticon/concepticon-data/issues/1217 - and I'd suggest to move that issue to a later milestone (in concepticon-data).
Ok, great. Will move the other one.
Vulic-2020-2244 seems to be a similar case.
It seems that the SimLex list in concepticon is missing the translations of the second word for each word-pair similarity rating. E.g. https://concepticon.clld.org/values/Vulic-2020-2244-3 should not have these lists of three times the translation of the main concept, but the translations of the SIMLEX_GLOSSES
. Fixing this would be a prerequisite to then aggregating the similarity ratings per concept and language into something like
{"arm": 1.5, "bone": 3.4}
...
I am not sure I understand this. The word in question is English muscle, linked to the conceptico concept set MUSCLE.
Simlex Glosses are ['arm', 'tongue', 'bone']. The IDS give us information how to construct the original SIMLEX wordpairs:
['1:2', '690:1', '696:1']
muscle:arm tongue:muscle bone:muscle
In this case, the English word form is the same, but since this is not always the case, we have the info on
English_in_source: ['muscle', 'muscle', 'muscle']
etc.
This only relates to the gloss in question.
If one wants to retrieve the counterpart, that is, the concept to which the base gloss is linked, one needs to check in the links:
Links: ['306', '62', '172']
Entry Vulic-2020-2244-306
is linked to concept set ARM here.
So the information is there, but yes, one could add ALL the other glosses that are linked to as well for each concept, but it was not considered important when creating the file, as our primary interest is the reflection of the key word form we look at here.
And note that the repetition of ['muscle', 'muscle', 'muscle'] is important, since we have cases, where we have a language with different words here, which we judged to be synonyms manually, when creating the database.
Ah, I see. Ok, will use this structure to compute a useful representation for NoRaRe. I think in particular in the NoRaRe case, it is important to have meaningful individual values for variables. So extracting just a list of numbers here (e.g. ENGLISH_SCORE = [0.45, 0.76, 0.34]
) would seem non-sensical. Maybe
{"arm": 1.5, "bone": 3.4}
is a bit too terse, but something like
{"arm": [1.5, "Vulic-2020-2244-306"], ...}
could be reasonable.
Yes, I see, and I agree that a better representation is surely needed, given the confusion it causes and the headache it causes myself all the time when I try to re-understand the data we created there...
E.g. for Scheible 2014 it doesn't make a lot of sense to list the three columns
IDS_IN_SOURCE,RELATION_TYPE,SCORES
as three variables, when in reality they must be composed properly to convey any meaning. So maybe we should have a way to synthesize variables (with the help of a bit of python code) from such lists.For anything in
concept_set_meta
we already have a place for such code. Maybe we could have something similar for conceptlists from Concepticon. E.g. have a moduleconcept_set_meta/Scheible-2014-1755/__init__.py
, which is imported when creating the CLDF data. This module could have a function per variable specified innorare.tsv
, e.g.In the Scheible 2014 case, values of
SCORED_RELATIONS
might bedict
s likecomputed from the "raw" column values