Neighbourhood density with text file

kchall commented 3 years ago

What should happen:

Have a .txt file with individual words that exist in a corpus, with the file being either in spelling or transcription.
Tell PCT to calculate ND from the wordlist, with the calculations based on either the spelling tier or the transcription tier (does not have to match the .txt file).
PCT uses the .txt file simply as an index to 'look up' the words and then does the calculations based on whichever option (orthography or transcription) is specified.
The results window should list BOTH the string type of the .txt file and the tier on which the scores were calculated.
If the list has a word that isn't in the corpus, PCT should show a warning and skip that word in the calculation.

Some issues that are currently happening:

PCT doesn't seem to pay attention to the 'tier' specification in the dialogue box at all.
PCT doesn't list the tier and the string type in the results window -- just one of them (which one? I think it's string type, but it doesn't seem to correspond to what it actually calculates).
If a word isn't in the corpus, PCT apparently treats the input .txt file as either transcription or spelling, depending on the string type specification, and then does the calculation based on that tier.
If a word is in the corpus, PCT apparently treats the input .txt file only as spelling -- it will calculate the same results regardless of whether the word's spelling and transcription actually align.

stannam commented 3 years ago

PCT doesn't seem to pay attention to the 'tier' specification in the dialogue box at all.
- The 'tier' specification counts, but the internal algorithm complicates the issue.
- In the example corpus, the spelling and transcription options should always return the same numbers, because there is no word pair which is spelling neighbours but not transcription neighbours (or vice versa).
- I added a new word 'tusha' /ʃ.u.ʃ.ɑ/ to the example corpus. It seems PCT counts transcription and spelling neighbours diffrently, as in the red box below.
- 'tusha' /ʃ.u.ʃ.ɑ/ has one spelling neighbour 'tusa' /t.u.s.a/. However, they are not transcription neighbours.
- The blue box needs an explanation: the ND_spell cell for tusa should be 1. That is because of the way "neighbour candidates" are generated.
PCT doesn't list the tier and the string type in the results window -- just one of them (which one? I think it's string type, but it doesn't seem to correspond to what it actually calculates).
- The 'string type' column corresponds to the tier option. tusha has one spelling neighbour and no transcription neighbour (same as the table above).
- I think the column name is misleading and it should correspond to the type of the external file (i.e., 'File contains Spelling' or 'File contains Transcription')
- I'll rename 'String type' into 'Tier', and add additional column for the file type.
If a word isn't in the corpus, PCT apparently treats the input .txt file as either transcription or spelling, depending on the string type specification, and then does the calculation based on that tier.
If a word is in the corpus, PCT apparently treats the input .txt file only as spelling -- it will calculate the same results regardless of whether the word's spelling and transcription actually align.
- When the file contains transcriptions, the result differs by the setting. However, The 'transcription' result seems a bit off.

What is missing is calculating spelling ND from a transcription .txt file. (e.g., for 'tusha' /ʃ.u.ʃ.ɑ/, PCT should be able to get the spelling SD with the file depicted above.)

stannam commented 2 years ago

Still needs to be done

[x] Make sure other algorithms (i.e., 'phonological edit distance,' etc.), similarity threshold, and the freq filter work as expected.
- NB: Keep Khorsi for now.
[x] 'Output list of neighbors to a file' option needs to be implemented.
- [x] Message box that lets the user know the neighbour list is exported. And while on this, add the same to:
  - [x] minimal pair export (FL),
  - [x] contexts export (MI),
  - [x] 'save to file' (all results window), and
  - [x] 'export feature...' (under the 'file' menu).
[x] In a .txt file with spellings, "N/A" for words not in the corpus.
[x] Warning message
- The message should be more informative.
- Detailed message should be exported to the ERRORS folder.

kchall commented 2 years ago

^ as you've noted above! :) A couple of additional comments:

If the .txt file contains spelling, then the warning message should basically be like the current string similarity warning message, and PCT should return "N/A" for words not in the corpus.
If the .txt file contains transcription, then the warning message can say "X words are not in the corpus. For details, please refer to file neigh_dens_error.txt in the ERRORS directory or click on “Show Details” below.

For all words that are in your file, ND will be calculated based on the words that do exist in the corpus. Note that words in the .txt file will not be added to the corpus, nor does PCT include any of the words in the .txt file itself when calculating the neighbourhood densities of each word.”

One other note: Although all ‘a’-like symbols in the .txt file I used were IPA [ɑ], the results window presents some as [a] and some as [ɑ] in the ‘Word’ column. Is this because if there is a matching word in the corpus, it displays the word’s spelling, and if not, it displays the transcription from the text file? Seems like in this case, it would be clearer to just display all the original elements from the .txt file…

stannam commented 2 years ago

Notes to myself:

Tried to calculate ND on iPHOD and it turned out PCT considers homophones as neighbours (e.g., the neighbours of 'cat' include 'Kat').
- Including 'Kat' is not expected (for more see #785)
- Except that, everything seems correct.
~~Do we need Khorsi?~~ Keep Khorsi for now.
Done: different messages depending on the .txt contents (spellings or transcriptions)
To do: display original elements from the .txt file in the ND result window.
- ~~Q: what if in the regular use case, i.e., without non-words in .txt? Will it be more intuitive to show the spelling, as all words are in the corpus?~~ ND currently has the option to export spellings.

kchall commented 2 years ago

Notes to our future selves:

The reason that the string similarity and neighbourhood density algorithms are different when it comes to being able to use the spelling tier (SS can; ND can't at the moment) is as follows (thanks @stannam for the explanation!):

"We decided to remove [the spelling tier] from ND because the current algorithm does not give us correct results for 'spelling' neighbours. For example, a word spelled 'za' cannot be a spelling neighbour of 'ta' within the example corpus. When calculating ND, PCT does not compare all possible pairs in the corpus, but instead takes a shortcut. However, this does not work for the spelling tier. The issue is separate from string similarity as PCT does compare every pair for string similarity.

When ND is calculated for all words (PCT provides this option), the number of computations can exponentially increase by the number of words in the corpus. If following a naive approach, the computer needs to compare every pair of words in the corpus. However, most pairs are already expected not to be neighbours. For example, if we were to hand-calculate ND for [ta], we won't consider words like [gaga] or [enuta]. Therefore, instead of considering all cases, PCT generates a list of candidates, and loop over this list to check if a candidate exists in the corpus. For [ta], PCT refers to the inventory chart and generates candidates by changing/adding/deleting one segment in [ta]. The candidates would includes [t], [a], [da], [sa], ... but never [gaga] or [enuta].

However, this logic does not apply to spellings, because PCT doesn't have the inventory of all letters, and strangely, it referred to the phoneme inventory in previous versions (see the example of 'tusa' above)."

PhonologicalCorpusTools / CorpusTools

Neighbourhood density with text file #770