PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools
http://phonologicalcorpustools.github.io/CorpusTools/
GNU General Public License v3.0
111 stars 16 forks source link

Neighbourhood density with text file #770

Closed kchall closed 2 years ago

kchall commented 3 years ago

What should happen:

  1. Have a .txt file with individual words that exist in a corpus, with the file being either in spelling or transcription.
  2. Tell PCT to calculate ND from the wordlist, with the calculations based on either the spelling tier or the transcription tier (does not have to match the .txt file).
  3. PCT uses the .txt file simply as an index to 'look up' the words and then does the calculations based on whichever option (orthography or transcription) is specified.
  4. The results window should list BOTH the string type of the .txt file and the tier on which the scores were calculated.
  5. If the list has a word that isn't in the corpus, PCT should show a warning and skip that word in the calculation.

Some issues that are currently happening:

  1. PCT doesn't seem to pay attention to the 'tier' specification in the dialogue box at all.
  2. PCT doesn't list the tier and the string type in the results window -- just one of them (which one? I think it's string type, but it doesn't seem to correspond to what it actually calculates).
  3. If a word isn't in the corpus, PCT apparently treats the input .txt file as either transcription or spelling, depending on the string type specification, and then does the calculation based on that tier.
  4. If a word is in the corpus, PCT apparently treats the input .txt file only as spelling -- it will calculate the same results regardless of whether the word's spelling and transcription actually align.
stannam commented 3 years ago
  1. PCT doesn't seem to pay attention to the 'tier' specification in the dialogue box at all.

    • The 'tier' specification counts, but the internal algorithm complicates the issue.
    • In the example corpus, the spelling and transcription options should always return the same numbers, because there is no word pair which is spelling neighbours but not transcription neighbours (or vice versa).
    • I added a new word 'tusha' /ʃ.u.ʃ.ɑ/ to the example corpus. It seems PCT counts transcription and spelling neighbours diffrently, as in the red box below.
    • image
    • 'tusha' /ʃ.u.ʃ.ɑ/ has one spelling neighbour 'tusa' /t.u.s.a/. However, they are not transcription neighbours.
    • The blue box needs an explanation: the ND_spell cell for tusa should be 1. That is because of the way "neighbour candidates" are generated.
  2. PCT doesn't list the tier and the string type in the results window -- just one of them (which one? I think it's string type, but it doesn't seem to correspond to what it actually calculates).

    • The 'string type' column corresponds to the tier option. tusha has one spelling neighbour and no transcription neighbour (same as the table above).
    • image
    • I think the column name is misleading and it should correspond to the type of the external file (i.e., 'File contains Spelling' or 'File contains Transcription')
    • I'll rename 'String type' into 'Tier', and add additional column for the file type.
  3. If a word isn't in the corpus, PCT apparently treats the input .txt file as either transcription or spelling, depending on the string type specification, and then does the calculation based on that tier.

  4. If a word is in the corpus, PCT apparently treats the input .txt file only as spelling -- it will calculate the same results regardless of whether the word's spelling and transcription actually align.

    • When the file contains transcriptions, the result differs by the setting. However, The 'transcription' result seems a bit off.

image

stannam commented 2 years ago

Still needs to be done

kchall commented 2 years ago

^ as you've noted above! :) A couple of additional comments:

For all words that are in your file, ND will be calculated based on the words that do exist in the corpus. Note that words in the .txt file will not be added to the corpus, nor does PCT include any of the words in the .txt file itself when calculating the neighbourhood densities of each word.”

One other note: Although all ‘a’-like symbols in the .txt file I used were IPA [ɑ], the results window presents some as [a] and some as [ɑ] in the ‘Word’ column. Is this because if there is a matching word in the corpus, it displays the word’s spelling, and if not, it displays the transcription from the text file? Seems like in this case, it would be clearer to just display all the original elements from the .txt file…

stannam commented 2 years ago

Notes to myself:

kchall commented 2 years ago

Notes to our future selves:

The reason that the string similarity and neighbourhood density algorithms are different when it comes to being able to use the spelling tier (SS can; ND can't at the moment) is as follows (thanks @stannam for the explanation!):

"We decided to remove [the spelling tier] from ND because the current algorithm does not give us correct results for 'spelling' neighbours. For example, a word spelled 'za' cannot be a spelling neighbour of 'ta' within the example corpus. When calculating ND, PCT does not compare all possible pairs in the corpus, but instead takes a shortcut. However, this does not work for the spelling tier. The issue is separate from string similarity as PCT does compare every pair for string similarity.

When ND is calculated for all words (PCT provides this option), the number of computations can exponentially increase by the number of words in the corpus. If following a naive approach, the computer needs to compare every pair of words in the corpus. However, most pairs are already expected not to be neighbours. For example, if we were to hand-calculate ND for [ta], we won't consider words like [gaga] or [enuta]. Therefore, instead of considering all cases, PCT generates a list of candidates, and loop over this list to check if a candidate exists in the corpus. For [ta], PCT refers to the inventory chart and generates candidates by changing/adding/deleting one segment in [ta]. The candidates would includes [t], [a], [da], [sa], ... but never [gaga] or [enuta].

However, this logic does not apply to spellings, because PCT doesn't have the inventory of all letters, and strangely, it referred to the phoneme inventory in previous versions (see the example of 'tusa' above)."