[BUG] Homophones in the neighbourhood density calculation

stannam commented 2 years ago

Currently, the neighbourhood density algorithm considers homophones of the query as neighbours even when the 'collapse homophones' option is on. For example, when calculating ND for 'cat' in the iphod corpus, 'Kat' (another entry in the corpus) is considered as a neighbour.

IPhOD (the benchmark) disregards homophones of the search term itself. For example, when searching for the neighbours of 'cat,' 'Kat' is not included in the results.

It seems IPhOD is correct, since homophones are not each other's neighbours....? But I just wondering if I'm thinking correctly.

stannam commented 2 years ago

No need to return the indentical results as the IPhOD webpage, but need to say why in the documentation.

kchall commented 2 years ago

Still having some issues with this:

Use August_lexicon.txt

Expected results: Spelling Transcription Frequency ND_with_homophs ND_collapse_homophs August ɔɡʌst 1 2 2 Aug ɔɡʌst 2 2 2 Auguste ɔɡʌst 3 2 2 August ɑɡʌst 5 4 2 gust ɡʌst 2 5 3 gus ɡʌs 6 1 1 aghast əɡæst 12 0 0 peace pis 23 0 0

Problem #1: If you calculate ND for all words in corpus, without collapsing homophones and export neighbours to .txt file (August_neighbours_no_collapsing_homophones.txt):

The numeric results in the corpus in PCT are correct (i.e., they match "ND_with_homophs" above)!
But only one entry “August” is listed in the output file — the one with 4 neighbours (so the one whose own pronunciation is [ɑɡʌst]). The other “August” ([ɔɡʌst]), which should have 2 neighbours, isn’t in the output file.

Problem #2: If you calculate ND for all words in corpus, this time collapsing homophones and export neighbours to .txt file (August_neighbours_with_collapsing_homophones.txt), PCT throws an error:

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 115, in run collapse_homophones = kwargs['collapse_homophones'] File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 87, in neighborhood_density_all_words collapse_homophones = collapse_homophones) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 183, in neighborhood_density w_sequence = getattr(w, sequence_type) TypeError: getattr(): attribute name must be string

Problem #3: If you have already added the neighbourhood density column to the corpus, and then try to re-do it (i.e., repeat the first one, without collapsing homophones), PCT crashes entirely, even if you have given the ND column a new name.

stannam commented 2 years ago

Note to myself:

find the text files under "Phonological_CorpusTools_Public > Internal_Documentation"
Problem 1: solved
- The problem was from how PCT saves the results intermediately before exporting as a txt.
- Previously, the neighbour lists are saved in a python dictionary named 'results.' Inside this dictionary, each corpus word is referenced by its spelling only. So, after calculating ND of a second instance of 'August,' the result overwrites any existing result of 'August'.
- (cf. results[str(w)] = [getattr(r, output_format) for r in res[1]] in neighborhood_density_all_words() ...where str(w) is the spelling of the search term)
- Now, the key value goes like 'spelling [s.p.ɛ.l.i.ŋ]' and every word is in the output file.
- questions: transcription for all words or just homographs? separate column for transcription in the output txt?
Problem 2ː solved
- the error was from having the parameter 'sequence_type' undefined in neighborhood_density_all_words().
- Now, a value is provided for this parameter and no more crashing!
Problem 3: can't replicate -- pct does not crash with different column names

kchall commented 2 years ago

(I had posted an error, but it was because my Github hadn't properly synced...I can confirm that I'm not getting crashing for either Prob. 2 or 3 now.)

kchall commented 2 years ago

Oh dear. I can confirm that all of the above problems seem to be solved -- yay! But the documentation (https://github.com/PhonologicalCorpusTools/CorpusTools/blob/master/docs/source/neighborhood_density.rst) brings up another case, where things don't match!

The corpus as described in the documentation is in the file in the PCT Dropbox folder called "ND_doc_corpus.txt." (Description: "if the word 'nata' [nɑtɑ] were in the corpus, along with the words 'mata' [mɑtɑ], 'mata' [mɑtɑ], 'sata' [sɑtɑ], and 'satha' [sɑtɑ]"...)

Let me step through what the documentation says, what actually happens, and what I think should happen:

a. Documentation says: "If the neighbourhood density of 'nata' is calculated without collapsing homophones, then it has a density of 4 ([mɑtɑ], [mɑtɑ], [sɑtɑ], and [sɑtɑ])"

--> What actually happens is that the result is 3 (whether you calculate for just 'nata' or for all words in the corpus): specifically, you get mata, sata, and satha but not the second version of mata. --> I think the documentation is correct. The result should be 4 in this case, because homophones are not collapsed.

b. Documentation says: "If the neighbourhood density of 'nata' is calculated after first collapsing homophones, then it has a density of 2 ([mɑtɑ] and [sɑtɑ])."

--> This is indeed what currently happens, whether you calculate for just 'nata' or for all words in the corpus, and I think it is correct as is.

c. Documentation says: "Note that if homophones are collapsed before calculating neighbourhood density, this will also affect any words that are homophones of the word in question. E.g., if the neighbourhood density of 'sata' is calculated in the above example, it will have a density of 4 if homophones are not collapsed ([mɑtɑ], [mɑtɑ], [nɑtɑ], and [sɑtɑ], with [sɑtɑ] coming only from 'satha')"...

--> What actually happens is that the result is 2 (whether you calculate for just 'sata' or for all words in the corpus): specifically, you get mata and nata but not the second version of mata or sata from 'satha.' --> I think the documentation is correct. The result should be 4 in this case, because homophones are not collapsed.

d. Documentation says: ..."while it will have a density of 2 if homophones are collapsed ([mɑtɑ] and [nɑtɑ]; [sɑtɑ] no longer counts as a neighbour because homophones are collapsed before any calculations are made)."

--> This is indeed what currently happens, whether you calculate for just 'sata' or for all words in the corpus, and I think it is correct as is.

e. Documentation says: "#NB: THIS IS CURRENTLY ONLY TRUE IF CALCULATING ND FOR ALL WORDS IN THE CORPUS; YOU GET DIFFERENT BEHAVIOUR IF IT'S ONE WORD AT A TIME! FIX THIS."

--> Ha ha. I think it is fixed insofar as I do get the same behaviour currently for all words vs. one at a time. But, the actual results are not correct in either case if homophones are NOT collapsed.

stannam commented 2 years ago

So, unexpected results when homophones are not collapsed....

a. neighbourhood density of 'nata' without collapsing homophones

Two 'words' that share both spelling and transcription seem to cause this problem. When I edited the spelling in one of the two [mɑtɑ] words, I got the correct results.
This is because the internal algorithm recognizes words by spelling and transcription.
Adding frequency to the determiners can be a solution for now, but eventually what if two words are identical in spelling transcription and frequency (i.e., all required word attributes)?

c. neighbourhood density of 'sata' without collapsing homophones

I think the result should be 3 (mata, mata and nata). Its own homophone, 'satha' should not be included.
- 3 is what we get if the two [mɑtɑ] words have different spellings.
Whether the result should be 3 or 4 depends on the definition of phonological neighbourhood. I think homophones should not be neighbours of each other, as we briefly discussed with the 'August' words in IPHOD (below). If neighbourhood density is defined as an edit distance of 1, homophones are not neighbours since the edit distance of two idential transcriptions is 0.

↔ for neighbours / ↮ for not neighbours

kchall commented 2 years ago

Ah, good point -- I'm good with 3, as long as we clarify that in the documentation.

Is there no other unique identifier for a 'word' in the corpus? E.g., if we add a pair of words that are identical on all three elements (spelling, transcription, and frequency) to the original text file (ND_corpus_with_duplicate.txt), the corpus does load with two separate identical rows. How are these distinguished? Or are they just two identical entries in a dictionary?

stannam commented 2 years ago

Each word is a separate python object, so words are distinguished even when they have same values for all three elements. The problem is when ND results are processed. Just like Problem 1 above (where a word of same spelling overwrote existing results), Word A overwrites existing results of Word B, if the two words have the same values for the elements.

I'll try to figure out how to solve this..

stannam commented 2 years ago

... and i think i figured it out.

the culprit was this one line in neighborhood_density()

https://github.com/PhonologicalCorpusTools/CorpusTools/blob/83bdf7e9ac0bbb7fa3bfa2fcf37532fb4b9969a2/corpustools/neighdens/neighborhood_density.py#L353

'matches' is the list of Word objects, containing all Words that are considered as neighbours. e.g., [mata, mata, sata, satha] in the 'nata' case. 'query' is the Word object that the user enters. e.g., [nata]

Long story short: we don't need this line anymore because of commit 4ab07cf6a830b93a39b2f185d6fdbd64192383d8, so I removed it.

The line mentioned above is for doing a set operation, removing intersection between matches and query. It removes [nata] in the set of neighbours -- and it was critical because of the way phonological neighbourhood was previously implemented.

In the previous implementation, two words are computed as neighbours if the edit distance of the two is n or less (usually, n = 1). Note that this always includes an edit distance of 0.

When calculating ND, PCT loops over each word, including the query itself. Since a word is identical to itself and the edit distance is 0 in this case, PCT wrongly includes the query itself to 'matches.'

But we don't want the query word in the neighbour list! So, the above line of code comes in and tries to remove it from the list. One of the easiest ways to compare duplicates between two lists is to make them sets and do a set operation. That is what the above code is doing. set(matches)-set([query]) removes duplicates between 'matches' and 'query,' which is the query word inside 'matches.'

The problem is that set() only allows unique elements, and two Word objects with the same Spelling and Transcription are 'the same' for deciding the uniqueness. Therefore, when converting the list into a set, only one of all 'mata' words can survive.

However, the set operation is not needed because one of the previous commits 4ab07cf6a830b93a39b2f185d6fdbd64192383d8 prevents [nata] from sneaking into the neighbour list in the first place. By two words with 0 edit distance are not neighbours. So I simply removed that line and double-checked that everything worked.

kchall commented 2 years ago

Yay! Confirming that everything in the August corpus and the ND_doc corpus looks correct, and I have updated the documentation (and also clarified that there is an inherent minimum distance of '1' to count as a neighbour -- i.e., homophones of the target don't count as neighbours, regardless of the setting about collapsing homophones).

PhonologicalCorpusTools / CorpusTools

[BUG] Homophones in the neighbourhood density calculation #785