PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools
http://phonologicalcorpustools.github.io/CorpusTools/
GNU General Public License v3.0
111 stars 16 forks source link

Enhancement to string similarity (with small bug) #769

Open kchall opened 3 years ago

kchall commented 3 years ago

Currently, when calculating string similarity for a .txt file that contains pairs of words, if there is a word that is not in the corpus, PCT simply crashes with no explanation.

Layers of fixes:

  1. Simple immediate bug fix: If there's a word not in the corpus, warn the user and then calculate the rest of the pairs normally, marking the problematic one(s) as being "NA".
  2. Potentially straightforward workaround: Allow a user to simply calculate the similarity of word pairs in a .txt file independently of a corpus. That is, if the word pairs are orthographic, the similarity calculation is orthographic; if the word pairs are transcription, the calculation is transcription.
  3. More complicated ideal fix: If there's a word not in the corpus, have PCT give the user an option: (a) skip the pair and calculate the rest normally; (b) calculate the similarity of the problematic pair based on whatever is in the .txt file, ignoring the corpus; or (c) provide an interface for inputting the transcription for the missing word, using the same transcription symbol inventory as the rest of the corpus.
stannam commented 3 years ago

How about fix 1 (i.e., warn the user and then return NA) for now and then doing 2 or 3 after the release? Fix 1 seems quite easy.

kchall commented 3 years ago

Yep, that's exactly what I was thinking! :)

Though see also #770 for related issues...

stannam commented 3 years ago

Notes to myself: follow the example of 'exhaustivity' in ProD -- note that the error message there contains "Show details..." which shows the whole list of words. (i) automatically export as .txt in the ERROR folder (ii) 'Show details...' option

(An error message in predictability of distribution) image

kchall commented 2 years ago

This is looking good! Change wording in error report slightly -- bolded places are edits (spacing, phrasing, etc.):

“3 words are not in the corpus. For details, please refer to file str_similarity_error.txt in the ERRORS directory or click on Show Details below.

Currently, the calculation is only available with the words in the corpus. Results for the words that are not in the corpus will be listed as N/A.”

and then in the “Details”:

“The following words are not found in the corpus 'example.' Currently, the string similarity calculation is only available for words already in the corpus.

The text file you loaded: /Users/KCH/Desktop/temp/test_word_pairs.txt Corpus: example Words not in the corpus: musa musa babi”

…and change button labelled ‘Close’ to ‘Continue.’

kchall commented 2 years ago

(Were we going to add the option for the user to specify if the .txt file is spelling or transcription? I can't remember.)

stannam commented 2 years ago

note to myself: