digling / intelligibility

MIT License
0 stars 0 forks source link

What is the workflow for learning? #3

Closed LinguList closed 7 months ago

LinguList commented 11 months ago

@justalingwist, I think we can quickly address the data problem now, although I do not have time for it before next week. But I would like to ask you now to please specify how you define the learning / classification problem here. The question is: what is the vector that you input and what is the output? What is it that you predict in the end? Can you describe this in human-readable terms, or also with examples here?

justalingwist commented 11 months ago

@LinguList

I hope this makes it a bit more clear:

Linguistic background: Building up on previous research, the hypothesis for this study is that being able to speak Dutch allows speakers to also understand German; while being able to speak German allows understanding Dutch to a lesser extent (Gooskens et al. 2015, Ház 2005).

This study: In the Linear Discriminative Learning framework speaking and understanding languages is described as word comprehension and word production. Word comprehension is understood as a mapping of phonology onto semantics, while word production is understood as a mapping of semantics onto a phonological form. With regard to the above described hypothesis, a LDL model that is trained on Dutch word forms should be able to predict (= comprehend and produce) German word forms with a reasonable high accuracy, while a model trained on German and tested on Dutch should result in lower classification accuracies.

We are using LDL as implemented in the JudiLing package for the Julia environment.

Input: For its predictions, LDL needs a vector for the phonological form (= C matrix) and a vector with distributional meanings (= S matrix) to represent semantics. The phonological form vectors are calculated by LDL automatically based on a provided dataset, usually in a .csv format. Potential transcriptions or syllabifications need to be done in the .csv file before feeding it into the model. Distributional representations are fed into the model as matrix (I usually create a matrix in R and save it as .rda file).

Output: The output of the model is a classification accuracy based on a correlation of the gold standard vector (= C or S matrix) with the predicted vector (=Chat for the phonological matrix, Shat for the semantic matrix), and a .csv list of correctly and incorrectly comprehended/produced forms that allow an error analysis.

References: Gooskens, C., van Bezooijen, R. & van Heuven, V. (2015). Mutual intelligibility of Dutch-German cognates by children: The devil is in the detail. Linguistics, 53(2), 255-283. https://doi.org/10.1515/ling-2015-0002

Ház, Eva. 2005. Deutsche und Niederländer: Untersuchungen zur Möglichkeit einer unmittelbaren Verständigung. Philologia 68, Hamburg: Dr. Kovač.

justalingwist commented 11 months ago

The workflow then is:

  1. create .csv datasets with word forms the model will be trained and tested on. If necessary, transcribe word forms and syllabify.
  2. Find distributional representations of meanings for these word forms. Since we are doing a cross-language modeling, multi-lingual representations would be ideal. For that I usually merge the word form list with the distributional representations of meanings to get a .csv that includes the forms and their respective vectors.
  3. Save the distributional meanings in a matrix format that can be used in LDL (I usually do that in R and save the matrix as .rda file that I can call via RCall in Julia)
  4. Call the .csv file with the word forms in JudiLing, call the distributional representations of meanings matrix in JudiLing.
  5. Run the model.
LinguList commented 11 months ago

So you get a word like "Fuß" and internally, this is then represented as a vector, based on n-grams, and from this vector, you predict another vector, representing semantics in a vector space model?

So internally, you have something like [1, 0, -0.5] and predict [0,2, 0.3, 0.4], right?

Then my question is: how are vectors for phonology reconstructed, and can we circumvent that they are done automatically?

And how is classification from a vector to another carried out internally? Is the prediction done for each individual vector and trained accordingly, that is, [a, b, c] -> A, [a, b, c] -> B, `[a, b, c] -> C? (this question is out of interest)

LinguList commented 11 months ago

If you predict the meaning of a word like Fuß in Dutch, this means you need COGNATES underlyingly. So my suggestion to check to which degree cognates predict each other is a useful complementation of the experiment. This is nice to see.

justalingwist commented 11 months ago

I definitely agree that we should have a look at the cognates and when writing up the paper I’d even do that BEFORE presenting the model, since I think this gives us important background information we need to interpret the model results.

The phonological form vector basically marks the presence or absence of a given chunk of analysis (n-grams, syllables) within each word form of the dataset we fed in. So rows of the C matrix are the word forms, columns of C matrix represent all possible chunks of analysis for the underlying dataset.

         f    u   ß   h

Fuß 1 1 1 0 Huf 1 1 0 1

We cannot work without it BUT we are free to decide on the chunk size (grams, phones, syllables). My suggestion would be to go for syllables here since it is already coded in the data but also a phonologically and cognitively valid unit.

The mapping is done using multivariate multiple regression using the C matrix (phonological forms) and the S matrix (distributional meanings). A comprehension weight matrix F is obtained by solving:

S = C * F

And a production weight matrix is obtained by solving:

C = S * G

We then use F and G to get the predicted semantic matrix (Shat) and the predicted form matrix (Chat): Shat = C F and Chat = S G. Each value in Shat and Chat is then a linear combination of the values in the corresponding C and S vectors.

I hope I’m making sense?

If you’re interested in the actual code used in JudiLing I’d refer you to the package: https://github.com/MegamindHenry/JudiLing.jl

LinguList commented 11 months ago

Can't we rather create the vectors ourselves? For phonetic representations? I ask, because I think I can provide much more interesting representations with LingPy.

LinguList commented 11 months ago

Regarding the procedure, I now understood. The equation is less important than the information that it is multivariate multipel regression, where I found a definition directly. So we predict from one vector on another vector.

justalingwist commented 11 months ago

I have never tried it but I guess as long as we have a form matrix to work with, it should be possible. I’ll inform myself about this and then we can give it a try and maybe even compare this to the LDL internal one.