PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools
http://phonologicalcorpustools.github.io/CorpusTools/
GNU General Public License v3.0
112 stars 16 forks source link

ND on list of words not in the corpus #543

Closed kchall closed 6 years ago

kchall commented 8 years ago

Allow users to have a list of word transcriptions that aren't in the corpus and calculate ND on them (currently, the "list" option only works on words in the corpus). (Maybe this already works in the Command Line?)

bhallen commented 8 years ago

This does work on the command line. The neighborhood density functions themselves currently assume that they're passed a query that is a Word object, and so I think the most straightforward approach would be to add something like pct_neighdens's ensure_query_is_word function into an appropriate place in the GUI code. (If the GUI can use the function exactly as-is, then we could just put it into neighdens.py and import.) I've copied the function below.

def ensure_query_is_word(query, corpus, sequence_type, trans_delimiter):
    if isinstance(query, Word):
        query_word = query
    else:
        try:
            query_word = corpus.corpus.find(query)
        except KeyError:
            if trans_delimiter == '':
                query_word = Word(**{sequence_type: list(query)})
            else:
                query_word = Word(**{sequence_type: query.split(trans_delimiter)})
    return query_word

If this doesn't work for whatever reason, we could also change the functional load functions to do this coercion themselves.

jsmackie commented 8 years ago

That's useful code, thanks! I've worked out a partial fix for this in the GUI, but I'm not quite sure what to do with the transcription delimiter parameter. It's not possible to retrieve the delimiter used when the corpus was loaded because (a) that information is not saved anywhere and (b) some corpora don't have delimiters supplied at any point (like the ones you can download). I guess the best thing would be for me to add in a text box where a user can specify any delimiter in use.

bhallen commented 8 years ago

That sounds like a good plan. Thanks, Scott.

Another option for getting delimiters would be to try doing something like inspect_csv does in corpus/io/csv.py, namely having a list of common delimiters that probably won't be used as segments like commas and tabs and just checking to see whether the provided string includes any of them (and if so, setting that as the delimiter).

jsmackie commented 8 years ago

Here is the solution I decided to go with, which I think will require the least work on the part of user. By default, assume that the input list contains words without any multi-character sequences. Users will be required to use periods (specifically) as delimiters only in any words that do contain multi-character sequences. For example, if a language has /ts/ as an affricate, then you have to write the word /atsa/ as "a.ts.a" in the file, but the word /blat/ can be delimited or written simply as "blat".

I've added a tooltip that explains this to the dialog window (using these same examples)

kchall commented 8 years ago

This all sounds good to me. In trying it out for updating the docs, I ran into the following errors:

First, while running spelling ND on a list of words not in the example corpus, themselves written orthographically (the file is available: CorpusTools/tests/data/wordlists/orth_words.txt):

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 264, in ensure_query_is_word query_word = corpus.corpus.find(query) File "/Users/KCH/Desktop/CorpusTools/corpustools/corpus/classes/lexicon.py", line 2487, in find raise KeyError('The word \"{}\" is not in the corpus'.format(word)) KeyError: 'The word "mito" is not in the corpus'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 43, in run call_back = kwargs['call_back']) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 114, in neighborhood_density return fast_neighborhood_density(corpus_context, query, corpus_context.sequence_type) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 162, in fast_neighborhood_density query = ensure_query_is_word(query, corpus_context, sequence_type) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 268, in ensure_query_is_word query_word = Word({sequence_type: list(query)}) File "/Users/KCH/Desktop/CorpusTools/corpustools/corpus/classes/lexicon.py", line 918, in init** if att.is_default: UnboundLocalError: local variable 'att' referenced before assignment

Second, while running transcription ND on a list of words not in the example corpus, themselves written in IPA (the file is available: CorpusTools/tests/data/wordlists/orth_words.txt): [I note that the same problem arises if a list of transcribed words that ARE already in the corpus is used.]

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/windows.py", line 254, in newTable self.calc() File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/windows.py", line 240, in calc kwargs = self.generateKwargs() File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 402, in generateKwargs text = load_words_neighden(path) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/io.py", line 7, in load_words_neighden for line in f: File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xca in position 5: ordinal not in range(128)

kchall commented 8 years ago

The transcription ND analysis does work if a list of words is used that ARE in the corpus, and are orthographically transcribed.

Using the same list of words, though, and asking for orthographic ND, doesn't return an error message, but does seem to be incorrect. Specifically, the word "nata" returns an ND of 1 if transcription is used (which is correct, as there is the word "mata" in the corpus), but returns an ND of 0 if orthography is used (which is incorrect for the same reason).

kchall commented 8 years ago

existing_orth_words.txt existing_trans_words.txt orth_words.txt trans_words.txt

kchall commented 7 years ago

As of 31 October 2016:

  1. Example corpus -- calculate ND -- spelling -- using spelled words NOT IN corpus -- from orth_words.txt: [This should work.]

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 43, in run call_back = kwargs['call_back']) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 118, in neighborhood_density return fast_neighborhood_density(corpus_context, query, corpus_context.sequence_type) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 167, in fast_neighborhood_density for candidate in generate_neighbor_candidates(corpus_context, query, sequence_type): File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 176, in generate_neighbor_candidates sequence = getattr(query, sequence_type) AttributeError: 'Word' object has no attribute 'Spelling'

  1. Example corpus -- calculate ND -- transcription -- using spelled words NOT IN corpus -- from orth_words.txt: [This should instead return an error saying "It's not possible to calculate transcription-based neighbourhood density using a list of words that only has orthographic values." Though this might be complicated if PCT has no way of knowing that the list of words is orthographic?]

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 43, in run call_back = kwargs['call_back']) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 118, in neighborhood_density return fast_neighborhood_density(corpus_context, query, corpus_context.sequence_type) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 167, in fast_neighborhood_density for candidate in generate_neighbor_candidates(corpus_context, query, sequence_type): File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 176, in generate_neighbor_candidates sequence = getattr(query, sequence_type) AttributeError: 'Word' object has no attribute 'Transcription'

  1. Example corpus -- calculate ND -- spelling OR transcription -- using transcribed words NOT IN corpus -- from trans_words.txt: [The spelling version should return an error saying "It's not possible to calculate spelling-based neighbourhood density using a list of words that only has transcription values." Though this might be complicated if PCT has no way of knowing that the list of words is transcribed? The transcription one should work.] \ See comment below for update.

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/windows.py", line 255, in newTable self.calc() File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/windows.py", line 241, in calc kwargs = self.generateKwargs() File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 409, in generateKwargs text = load_words_neighden(path) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/io.py", line 7, in load_words_neighden for line in f: File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xca in position 5: ordinal not in range(128)

  1. Example corpus -- calculate ND -- spelling -- using spelled words IN corpus -- from existing_orth_words.txt:

No error is given, but the results are incorrect. The ND values are all 0, even when there is at least one neighbour (as in the case of [nata]).

  1. Example corpus -- calculate ND -- transcription -- using spelled words IN corpus -- from existing_orth_words.txt:

No error is given, but the results are incorrect. The ND values are all 0, even when there is at least one neighbour (as in the case of [nata]).

  1. Example corpus -- calculate ND -- spelling -- using transcribed words IN corpus -- from existing_trans_words.txt: [This should work; PCT should look up the words' spelling from their transcription. Though this might be complicated if PCT has no way of knowing that the list of words is transcribed?] \ See comment below for update.

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/windows.py", line 255, in newTable self.calc() File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/windows.py", line 241, in calc kwargs = self.generateKwargs() File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 409, in generateKwargs text = load_words_neighden(path) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/io.py", line 7, in load_words_neighden for line in f: File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc9 in position 0: ordinal not in range(128)

  1. Example corpus -- calculate ND -- transcription -- using transcribed words IN corpus -- from existing_trans_words.txt: [This should work.] \ See comment below for update.

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/windows.py", line 255, in newTable self.calc() File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/windows.py", line 241, in calc kwargs = self.generateKwargs() File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 409, in generateKwargs text = load_words_neighden(path) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/io.py", line 7, in load_words_neighden for line in f: File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc9 in position 0: ordinal not in range(128)

[General question: does PCT expect the list of words to have a header like "Spelling" or "Transcription"? Would that help? -- or add a selection within the GUI for whether the list has transcribed or spelled forms?]

kchall commented 7 years ago

Two of the ones have changed their errors from what I listed above, after Scott updated the example.corpus file:

  1. Example corpus -- calculate ND -- transcription -- using transcribed words NOT IN corpus -- from trans_words.txt:

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 43, in run call_back = kwargs['call_back']) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 118, in neighborhood_density return fast_neighborhood_density(corpus_context, query, corpus_context.sequence_type) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 167, in fast_neighborhood_density for candidate in generate_neighbor_candidates(corpus_context, query, sequence_type): File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 176, in generate_neighbor_candidates sequence = getattr(query, sequence_type) AttributeError: 'Word' object has no attribute 'Transcription'

  1. Example corpus -- calculate ND -- spelling -- using transcribed words NOT IN corpus -- from trans_words.txt:

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 43, in run call_back = kwargs['call_back']) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 118, in neighborhood_density return fast_neighborhood_density(corpus_context, query, corpus_context.sequence_type) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 167, in fast_neighborhood_density for candidate in generate_neighbor_candidates(corpus_context, query, sequence_type): File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 176, in generate_neighbor_candidates sequence = getattr(query, sequence_type) AttributeError: 'Word' object has no attribute 'Spelling'

  1. Example corpus -- calculate ND -- transcription -- using transcribed words IN corpus -- from existing_trans_words.txt:

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 43, in run call_back = kwargs['call_back']) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 118, in neighborhood_density return fast_neighborhood_density(corpus_context, query, corpus_context.sequence_type) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 167, in fast_neighborhood_density for candidate in generate_neighbor_candidates(corpus_context, query, sequence_type): File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 176, in generate_neighbor_candidates sequence = getattr(query, sequence_type) AttributeError: 'Word' object has no attribute 'Transcription'

  1. Example corpus -- calculate ND -- spelling -- using transcribed words IN corpus -- from existing_trans_words.txt:

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/ndgui.py", line 43, in run call_back = kwargs['call_back']) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 118, in neighborhood_density return fast_neighborhood_density(corpus_context, query, corpus_context.sequence_type) File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 167, in fast_neighborhood_density for candidate in generate_neighbor_candidates(corpus_context, query, sequence_type): File "/Users/KCH/Desktop/CorpusTools/corpustools/neighdens/neighborhood_density.py", line 176, in generate_neighbor_candidates sequence = getattr(query, sequence_type) AttributeError: 'Word' object has no attribute 'Spelling'

kchall commented 7 years ago

spelling neighbourhood distance is still not working (nata / mata)

kchall commented 7 years ago

also still need to allow options for whether the list is transcribed / spelled separate from whether the calculation is based on transcription vs. spelling

kchall commented 7 years ago

Folks in the Speech in Context lab are trying to use this option, but are running into problems. They are excellently thorough in their reporting:

"We are trying to look for the possible lexical neighbours of the nonwords used in the Savoury experiment, but I have run into a few issues with PCT. I wanted to check in with you to see if I have made any missteps, of if I broke the program. My process is detailed below:

A list of nonwords was prepared as a .txt file, with one word per line. (Attached below)

The latest PCT (v1.2.0) was downloaded from Github, and I ran the program through the security dialogue box.

I loaded the IPhod corpus (by going File > Load corpus... > Download example copora > IPHOD > OK > selecting it from the Available corpora list > Load selected corpus)

To find lexical neighbours, I tried the neighbourhood density analysis a few different ways, following the instructions from http://corpustools.readthedocs.io/en/latest/neighborhood_density.html (Analysis > Calculate neighborhood density...)

With "String similarity algorithm" set to "Edit Distance", I then set the "Query" for "Calculate for a list of words" and selected the nonwords .txt file. From the dropdown, I specified that the "File contains Spelling". The other options, such as "Tier" and "Max distance/min similarity", were left as is. I then selected "Calculate neighborhood density (start new results table)". This resulted in a "ValueError: Words must be specified with at least a spelling or a transcription." (List of Errors attached as a .txt file below.)

I also tried changing the "Tier" to "Spelling", and this resulted in the same error.

One of the functions in the error message is named "ensure_query_is_word", so I thought it might be an issue that it is a whole list of nonwords. I tried to add nonwords one by one in the Neighbourhood Analysis window (Query > Caclulate for a word/nonword not in corpus > Create word/nonword), and also in the main Phonological CorpusTools window (Corpus > Add new word...).

This crashed the program regardless of method and however many fields I filled in.

Please let me know if any of the above is confusing or if you have any questions or comments!

We would also be interested in knowing the possible neighbours and not just a number - and currently that is set to only be possible to be calculated for one word in the corpus. Would this be possible to be calculated/outputted for a list of words?"

PCT Error Messages for Nonword Neighbourhood Density.txt

Savoury - List of nonwords.txt