Open FredericBlum opened 8 months ago
Can you please specify the failure?
(predict) blum@lingn45 orthography % python predict.py
Traceback (most recent call last):
File "/Users/blum/Projects/cognate_prediction/orthography/predict.py", line 83, in <module>
D[i] = [wl[idx, "concept"], "x", tokens_, "".join(segment(tokens_, prf))]
^^^^^^^^^^^^^^^^^^^^^
File "/Users/blum/Projects/cognate_prediction/orthography/predict.py", line 66, in segment
segmented = profile_sequence(normalize("NFC", space.join(sequence)), segments).split(" ")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/blum/Projects/cognate_prediction/orthography/predict.py", line 49, in profile_sequence
return ' '.join(sorted(out, key=lambda x: len([y for y in x if y[0] !=
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/blum/Projects/cognate_prediction/orthography/predict.py", line 49, in <lambda>
return ' '.join(sorted(out, key=lambda x: len([y for y in x if y[0] !=
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/blum/Projects/cognate_prediction/orthography/predict.py", line 49, in <listcomp>
return ' '.join(sorted(out, key=lambda x: len([y for y in x if y[0] !=
~^^^
IndexError: string index out of range
My solution is to replace
for idx, tokens_ in wl.iter_rows("tokens"):
with
for idx, tokens_ in wl.iter_rows("tokens"):
if len(tokens_) > 1:
But I think this is too much.
if len(tokens_) > 0:
Because y[0] is causing the error, so it is an empty string in your data, and this is an error in your profile, I'd argue now.
The empty string is not in the data, but gets created during the profile sequence. The parts of the code that probably cause the problem:
queue = [([''], 0, string)]
# gets iterated through:
if len(rest) > 1:
...
else:
seqA = current_sequence[:-1]+[combined_element]
seqB = current_sequence + [next_element]
if not [x for x in seqA if (x not in segments and len(x) > 1)]:
out += [seqA]
if not [x for x in seqB if (x not in segments and len(x) > 1)]:
out += [seqB]
The len(x)>1 condition causes the code to fail, and out
becomes the following output, as created for queue
above:
1599 e # input
[['e'], ['', 'e']]. # out
The problems are not empty strings, but strings of length 1. Hence the >1 condition.
Okay, can you please provide a minimal example now, with input to the function, so that I can replicate the error?
Because this points to a general but that we must fix.
Can you give me the permissions to set up a PR? Or should I commit directly to main
?
But the fix you have there is no fix, it is dealing with symptoms, so I'd prefer to understand where the function I wrote fails. And it is the function which is the problem then, if you are right here.
I just gave you access, but I hope we agree to fix the funciton, not the wordlist iteration?
A fix is this one:
def profile_sequence(string, segments, maxlen=None):
if len(string) <= 1:
return string
max_len = maxlen or max([len(x) for x in segments])
But the function must probably be checked. Guess, it is better to use my JavaScript and translate from there.
Good thing is: until the paper goes out, we can update.
If the dataset has some entries that consist of only one sound (as common with affixes and sometimes verbs), the
grouping.py
sound will fail. While we cannot, of course, group those sounds, we should probably build an exception so that the code does not fail in those cases.