PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools
http://phonologicalcorpustools.github.io/CorpusTools/
GNU General Public License v3.0
111 stars 16 forks source link

[BUG] Homophones in the neighbourhood density calculation #785

Closed stannam closed 2 years ago

stannam commented 2 years ago

Currently, the neighbourhood density algorithm considers homophones of the query as neighbours even when the 'collapse homophones' option is on. For example, when calculating ND for 'cat' in the iphod corpus, 'Kat' (another entry in the corpus) is considered as a neighbour.

IPhOD (the benchmark) disregards homophones of the search term itself. For example, when searching for the neighbours of 'cat,' 'Kat' is not included in the results.

It seems IPhOD is correct, since homophones are not each other's neighbours....? But I just wondering if I'm thinking correctly.

stannam commented 2 years ago

No need to return the indentical results as the IPhOD webpage, but need to say why in the documentation.

kchall commented 2 years ago

Still having some issues with this:

Use August_lexicon.txt

Expected results: Spelling Transcription Frequency ND_with_homophs ND_collapse_homophs August ɔɡʌst 1 2 2 Aug ɔɡʌst 2 2 2 Auguste ɔɡʌst 3 2 2 August ɑɡʌst 5 4 2 gust ɡʌst 2 5 3 gus ɡʌs 6 1 1 aghast əɡæst 12 0 0 peace pis 23 0 0

Problem #1: If you calculate ND for all words in corpus, without collapsing homophones and export neighbours to .txt file (August_neighbours_no_collapsing_homophones.txt):

  1. The numeric results in the corpus in PCT are correct (i.e., they match "ND_with_homophs" above)!
  2. But only one entry “August” is listed in the output file — the one with 4 neighbours (so the one whose own pronunciation is [ɑɡʌst]). The other “August” ([ɔɡʌst]), which should have 2 neighbours, isn’t in the output file.

Problem #2: If you calculate ND for all words in corpus, this time collapsing homophones and export neighbours to .txt file (August_neighbours_with_collapsing_homophones.txt), PCT throws an error:

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 115, in run collapse_homophones = kwargs['collapse_homophones'] File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 87, in neighborhood_density_all_words collapse_homophones = collapse_homophones) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 183, in neighborhood_density w_sequence = getattr(w, sequence_type) TypeError: getattr(): attribute name must be string

Problem #3: If you have already added the neighbourhood density column to the corpus, and then try to re-do it (i.e., repeat the first one, without collapsing homophones), PCT crashes entirely, even if you have given the ND column a new name.

stannam commented 2 years ago

Note to myself:

kchall commented 2 years ago

(I had posted an error, but it was because my Github hadn't properly synced...I can confirm that I'm not getting crashing for either Prob. 2 or 3 now.)

kchall commented 2 years ago

Oh dear. I can confirm that all of the above problems seem to be solved -- yay! But the documentation (https://github.com/PhonologicalCorpusTools/CorpusTools/blob/master/docs/source/neighborhood_density.rst) brings up another case, where things don't match!

The corpus as described in the documentation is in the file in the PCT Dropbox folder called "ND_doc_corpus.txt." (Description: "if the word 'nata' [nɑtɑ] were in the corpus, along with the words 'mata' [mɑtɑ], 'mata' [mɑtɑ], 'sata' [sɑtɑ], and 'satha' [sɑtɑ]"...)

Let me step through what the documentation says, what actually happens, and what I think should happen:

a. Documentation says: "If the neighbourhood density of 'nata' is calculated without collapsing homophones, then it has a density of 4 ([mɑtɑ], [mɑtɑ], [sɑtɑ], and [sɑtɑ])"

--> What actually happens is that the result is 3 (whether you calculate for just 'nata' or for all words in the corpus): specifically, you get mata, sata, and satha but not the second version of mata. --> I think the documentation is correct. The result should be 4 in this case, because homophones are not collapsed.

b. Documentation says: "If the neighbourhood density of 'nata' is calculated after first collapsing homophones, then it has a density of 2 ([mɑtɑ] and [sɑtɑ])."

--> This is indeed what currently happens, whether you calculate for just 'nata' or for all words in the corpus, and I think it is correct as is.

c. Documentation says: "Note that if homophones are collapsed before calculating neighbourhood density, this will also affect any words that are homophones of the word in question. E.g., if the neighbourhood density of 'sata' is calculated in the above example, it will have a density of 4 if homophones are not collapsed ([mɑtɑ], [mɑtɑ], [nɑtɑ], and [sɑtɑ], with [sɑtɑ] coming only from 'satha')"...

--> What actually happens is that the result is 2 (whether you calculate for just 'sata' or for all words in the corpus): specifically, you get mata and nata but not the second version of mata or sata from 'satha.' --> I think the documentation is correct. The result should be 4 in this case, because homophones are not collapsed.

d. Documentation says: ..."while it will have a density of 2 if homophones are collapsed ([mɑtɑ] and [nɑtɑ]; [sɑtɑ] no longer counts as a neighbour because homophones are collapsed before any calculations are made)."

--> This is indeed what currently happens, whether you calculate for just 'sata' or for all words in the corpus, and I think it is correct as is.

e. Documentation says: "#NB: THIS IS CURRENTLY ONLY TRUE IF CALCULATING ND FOR ALL WORDS IN THE CORPUS; YOU GET DIFFERENT BEHAVIOUR IF IT'S ONE WORD AT A TIME! FIX THIS."

--> Ha ha. I think it is fixed insofar as I do get the same behaviour currently for all words vs. one at a time. But, the actual results are not correct in either case if homophones are NOT collapsed.

stannam commented 2 years ago

So, unexpected results when homophones are not collapsed....

a. neighbourhood density of 'nata' without collapsing homophones

c. neighbourhood density of 'sata' without collapsing homophones

image

↔ for neighbours / ↮ for not neighbours

kchall commented 2 years ago

Ah, good point -- I'm good with 3, as long as we clarify that in the documentation.

Is there no other unique identifier for a 'word' in the corpus? E.g., if we add a pair of words that are identical on all three elements (spelling, transcription, and frequency) to the original text file (ND_corpus_with_duplicate.txt), the corpus does load with two separate identical rows. How are these distinguished? Or are they just two identical entries in a dictionary?

stannam commented 2 years ago

Each word is a separate python object, so words are distinguished even when they have same values for all three elements. The problem is when ND results are processed. Just like Problem 1 above (where a word of same spelling overwrote existing results), Word A overwrites existing results of Word B, if the two words have the same values for the elements.

I'll try to figure out how to solve this..

stannam commented 2 years ago

... and i think i figured it out.

the culprit was this one line in neighborhood_density()

https://github.com/PhonologicalCorpusTools/CorpusTools/blob/83bdf7e9ac0bbb7fa3bfa2fcf37532fb4b9969a2/corpustools/neighdens/neighborhood_density.py#L353

'matches' is the list of Word objects, containing all Words that are considered as neighbours. e.g., [mata, mata, sata, satha] in the 'nata' case. 'query' is the Word object that the user enters. e.g., [nata]

Long story short: we don't need this line anymore because of commit 4ab07cf6a830b93a39b2f185d6fdbd64192383d8, so I removed it.

The line mentioned above is for doing a set operation, removing intersection between matches and query. It removes [nata] in the set of neighbours -- and it was critical because of the way phonological neighbourhood was previously implemented.

In the previous implementation, two words are computed as neighbours if the edit distance of the two is n or less (usually, n = 1). Note that this always includes an edit distance of 0.

When calculating ND, PCT loops over each word, including the query itself. Since a word is identical to itself and the edit distance is 0 in this case, PCT wrongly includes the query itself to 'matches.'

But we don't want the query word in the neighbour list! So, the above line of code comes in and tries to remove it from the list. One of the easiest ways to compare duplicates between two lists is to make them sets and do a set operation. That is what the above code is doing. set(matches)-set([query]) removes duplicates between 'matches' and 'query,' which is the query word inside 'matches.'

The problem is that set() only allows unique elements, and two Word objects with the same Spelling and Transcription are 'the same' for deciding the uniqueness. Therefore, when converting the list into a set, only one of all 'mata' words can survive.

However, the set operation is not needed because one of the previous commits 4ab07cf6a830b93a39b2f185d6fdbd64192383d8 prevents [nata] from sneaking into the neighbour list in the first place. By two words with 0 edit distance are not neighbours. So I simply removed that line and double-checked that everything worked.

kchall commented 2 years ago

Yay! Confirming that everything in the August corpus and the ND_doc corpus looks correct, and I have updated the documentation (and also clarified that there is an inherent minimum distance of '1' to count as a neighbour -- i.e., homophones of the target don't count as neighbours, regardless of the setting about collapsing homophones).