col Frequency - Githubissues

LoanDB / ronataswestoldturkic

CLDF dataset derived from 'West Old Turkic' by András Róna-Tas and Árpád Berta from 2011

https://www.harrassowitz-verlag.de/title_4002.ahtml

Creative Commons Attribution 4.0 International

0 stars 0 forks source link

col Frequency #9

Closed martino-vic closed 2 years ago

martino-vic commented 2 years ago

I'm trying to add a column to forms.csv that counts how often each prosodic structure occurs in total in column "ProsodicStructure" in forms.csv and I can't figure out how to add this to the lexibank script. It's similar to this issue: I can't add the info from within the loop because I can count the number of occurrences only after the loop has ended. But somehow I don't manage to start a second loop at the bottom where I insert this info. Is there some kind of workaround for this @LinguList ?

LinguList commented 2 years ago

I need more context: the forms.csv should have an extra column of which format? Can you provide a minimal example with CSV header and one line, where I see what you want to count how?

martino-vic commented 2 years ago

Ah, yes of course, sorry:

The rows of the column should contain integers that indicate how often each prosodic structure occurs in the entire column. It should look something like this for example:

Segments      ProsodicStructure          Frequency
k i k i        CVCV                         2
b u b a        CVCV                         2
w u g          CVC                          1

meaning that "CVCV" occurs 2 times in our data and "CVC" 1 time.

martino-vic commented 2 years ago

Some background info why I need this: I want to know which prosodic structures are documented (i.e. certainly allowed) in the recipient language, so when predicting loanword adaptation there's an option to filter out words with an undocumented prosodic structure. The hook is that sometimes there are some untypical structures in the data, that just occur in few words, but are otherwise not allowed. If we know their frequencies, it's possible to inspect the data manually and decide how frequent a structure should be to make it part of the inventory of prosodic structures of the recipient language. I hope this explanation makes some sense

LinguList commented 2 years ago

I see your point. This requires a tweak that I am not really happy to add there: you would have to segment the data outside of the loop that adds the form args.writer.add_form... in order to get these numbers. In my opinion, although this is something one could count here already, it is something that is perfectly done from within loanpy. LingPy also counts all segments and checks how often they occur (with a defaultdict, even no counter) to use this to smooth the data later on (words occurring twice do not contribute to the correspondence patterns). So my suggestion: do it in loanpy.

martino-vic commented 2 years ago

Ah I see, yes that does make sense