calc-project / partial-body-object

Partial colexifications between body and object concepts
MIT License
0 stars 0 forks source link

1 Analysis of Directionality #1

Closed AnnikaTjuka closed 7 months ago

AnnikaTjuka commented 8 months ago

The first analysis should answer the question: Is there a directional tendency from body to object?

Data: I would use the 100 body-object colexifications extracted for my study on body-object colexifications and stored as lexical features here. Another starting point would be the full list of 134 body concepts and 650 object concepts from Tjuka-2022-784, but I assume that would create a very noisy network.

Directionality: To test the directionality, we could use a) the frequencies of affix colexifications of the 100 body-object colexifications from List (2023). Probably not all of the 100 colexifications occur in the List-2023-1308, but I think that's okay. b) the frequencies of directions of the 100 body-object colexifications from DatSemShifts: Zalizniak-2024-4583. I found almost all of the 78 most frequent body-object colexifications analysed in my previous study in DatSemShift, so I assume that most of the 100 colexifications would also be matched here.

@LinguList Does this seem reasonable and possible to implement?

LinguList commented 8 months ago

Yes. Very reasonable.

LinguList commented 8 months ago

I imagine it is easy to first derive the networks and then check link frequencies. Networks can be easily derived with pycldf, where you could feed in the concept list from your study (which is in concepticon, right?) But a hand-crafted CSV file containing A / B links (concepticon ID or Gloss) would also suffice. I could probably just share some Python snippets for some toy concept pairs, and you could build on that and do further analyses?

AnnikaTjuka commented 7 months ago

Python snippets would be great. I have a list with the 78 colexifications analysed in my previous study and can extend it with the 22 less frequent colexifications. So far, only the seed list is included in Concepticon, because I'm still waiting for the proofs and don't know when the study will be published online.

LinguList commented 7 months ago
from pyconcepticon import Concepticon
from csvw import UnicodeDictReader, UnicodeWriter

# toy example
concepts = [
        ["ARM", "HAND"],
        ["FOOT", "LEG"]
        ]

# replace with:
# with UnicodeDictReader("yourfile.tsv", delimiter="\t") as reader:
#     concepts = []
#     for row in reader:
#         concepts += [row]

graph = {}
for a, b in concepts:
    graph[a, b] = 0

con = Concepticon()
cl = con.conceptlists["List-2023-1308"]
id2cgl = {c.id: c.concepticon_gloss for c in cl.concepts.values()}
for c in cl.concepts.values():
    gloss_a = c.concepticon_gloss
    for itm in c.attributes["target_concepts"]:
        gloss_b = id2cgl[itm["ID"]]
        if (gloss_a, gloss_b) in graph:
            graph[gloss_a, gloss_b] = itm["AffixFams"]
table = [["Source", "Target", "Count"]] + [[a, b, c] for (a, b), c in graph.items()]
with UnicodeWriter("out.tsv", delimiter="\t") as writer:
    for row in table:
        writer.writerow(row)
LinguList commented 7 months ago

For DatSemShift it is similar, just the attribute name of the network part is called differently.

LinguList commented 7 months ago

There, use PolysemyByFamily instead of AffixFams.

AnnikaTjuka commented 7 months ago

Thanks for the script, @LinguList! I extracted the relevant data from List-2023-1308. I used languages instead of families and compared the number of AffixLngs for both directions (SKIN-BARK and BARK-SKIN). Data are available for 40 of the 100 colexifications. Interestingly, only in half of the cases, the majority of the affix colexifications contain the concept of the body part. Thus, the direction is BODY->OBJECT in 20 out of 40 colexifications.

I tried to do the same analysis with the DatSemShift data, but there is very little overlap with my 100 body-object colexifications. I was able to use a more coarse-grained approach when I searched for the b-o colexifications by hand because the glosses in DatSemShift are not standardized. For example, we mapped "skin (of a person)" to SKIN (HUMAN), which is correct, but no cases for the SKIN-BARK colexification were found due to this mapping. Therefore, I'd refrain from including the comparison with DatSemShifts because it would mean either going through the mappings again or trying to take a more coarse-grained approach in the selection process.

LinguList commented 7 months ago

That is a problem, yes. We have a similar problem in Glottolog, if you have some dialect, like Beijing Chinese, vs. Mandarin. We handled this also explicitly, you remember, in CLICS-4, by even going as far as listing OR-concepts as colexifications.

LinguList commented 7 months ago

But it is better to take what has exact mappings and not do this unless one has a principled procedure. One can, maybe, refer to DatSemShift in the intro, showing an example with numbers, with a footnote pointing to sparsity of data?

AnnikaTjuka commented 7 months ago

Yes, that's a general issue and I didn't want to come up with a half-baked solution. I think it's a good idea to give an example and point out the scarcity of data in the intro.