Grouping fails for tokens of length 1

FredericBlum commented 8 months ago

If the dataset has some entries that consist of only one sound (as common with affixes and sometimes verbs), the grouping.py sound will fail. While we cannot, of course, group those sounds, we should probably build an exception so that the code does not fail in those cases.

LinguList commented 8 months ago

Can you please specify the failure?

FredericBlum commented 7 months ago

(predict) blum@lingn45 orthography % python predict.py
Traceback (most recent call last):
  File "/Users/blum/Projects/cognate_prediction/orthography/predict.py", line 83, in <module>
    D[i] = [wl[idx, "concept"], "x", tokens_, "".join(segment(tokens_, prf))]
                                                      ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/cognate_prediction/orthography/predict.py", line 66, in segment
    segmented = profile_sequence(normalize("NFC", space.join(sequence)), segments).split(" ")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/cognate_prediction/orthography/predict.py", line 49, in profile_sequence
    return ' '.join(sorted(out, key=lambda x: len([y for y in x if y[0] !=
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/cognate_prediction/orthography/predict.py", line 49, in <lambda>
    return ' '.join(sorted(out, key=lambda x: len([y for y in x if y[0] !=
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/cognate_prediction/orthography/predict.py", line 49, in <listcomp>
    return ' '.join(sorted(out, key=lambda x: len([y for y in x if y[0] !=
                                                                   ~^^^
IndexError: string index out of range

My solution is to replace

for idx, tokens_ in wl.iter_rows("tokens"):

with

for idx, tokens_ in wl.iter_rows("tokens"):
    if len(tokens_) > 1:

LinguList commented 7 months ago

But I think this is too much.

LinguList commented 7 months ago

if len(tokens_) > 0:

LinguList commented 7 months ago

Because y[0] is causing the error, so it is an empty string in your data, and this is an error in your profile, I'd argue now.

FredericBlum commented 7 months ago

The empty string is not in the data, but gets created during the profile sequence. The parts of the code that probably cause the problem:

queue = [([''], 0, string)]
# gets iterated through:
if len(rest) > 1:
    ...
else:
    seqA = current_sequence[:-1]+[combined_element]
    seqB = current_sequence + [next_element]

    if not [x for x in seqA if (x not in segments and len(x) > 1)]:
        out += [seqA]
    if not [x for x in seqB if (x not in segments and len(x) > 1)]:
        out += [seqB]

The len(x)>1 condition causes the code to fail, and out becomes the following output, as created for queue above:

1599 e  # input
[['e'], ['', 'e']]. # out

The problems are not empty strings, but strings of length 1. Hence the >1 condition.

LinguList commented 7 months ago

Okay, can you please provide a minimal example now, with input to the function, so that I can replicate the error?

LinguList commented 7 months ago

Because this points to a general but that we must fix.

FredericBlum commented 7 months ago

Can you give me the permissions to set up a PR? Or should I commit directly to main?

LinguList commented 7 months ago

But the fix you have there is no fix, it is dealing with symptoms, so I'd prefer to understand where the function I wrote fails. And it is the function which is the problem then, if you are right here.

LinguList commented 7 months ago

I just gave you access, but I hope we agree to fix the funciton, not the wordlist iteration?

LinguList commented 7 months ago

A fix is this one:

def profile_sequence(string, segments, maxlen=None):

    if len(string) <= 1:
        return string

    max_len = maxlen or max([len(x) for x in segments])

LinguList commented 7 months ago

But the function must probably be checked. Guess, it is better to use my JavaScript and translate from there.

LinguList commented 7 months ago

Good thing is: until the paper goes out, we can update.

calc-project / grouping-sounds

Grouping fails for tokens of length 1 #2