amir-zeldes / HebPipe

An NLP pipeline for Hebrew
Other
34 stars 10 forks source link

IndexError: list index out of range #39

Closed menachemsperka closed 1 year ago

menachemsperka commented 1 year ago

any idea why this may happen? its working for some files but for others I'm getting the "IndexError: list index out of range" error. (using CPU, no GPU) *Issue12 and 13 address this and refer to the Java requirement, seems that that was changed already, and Java is not a requirement, so raising a new issue

amir-zeldes commented 1 year ago

It's hard to say without the full stack trace - can you paste the entire error output here?

menachemsperka commented 1 year ago

full error output:


(heb_pipe_env) F:\nlp_project\HebPipe\hebpipe\heb_pipe_env\HebPipe\hebpipe>python heb_pipe.py  heb_file.txt --cpu
! You selected no processing options
! Assuming you want all processing steps

Running tasks:
====================
o Automatic sentence splitting (neural)
o Whitespace tokenization
o Morphological segmentation
o POS and Morphological tagging
o Lemmatization
o Dependency parsing
o Entity recognition
o Coreference resolution

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 216kB [00:00, 8.64MB/s]
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
Processing heb_file.txt
F:\nlp_project\HebPipe\hebpipe\heb_pipe_env\lib\site-packages\sklearn\base.py:347: InconsistentVersionWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.3.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
Traceback (most recent call last):
  File "heb_pipe.py", line 851, in <module>
    run_hebpipe()
  File "heb_pipe.py", line 828, in run_hebpipe
    processed = nlp(input_text, do_whitespace=opts.whitespace, do_tok=dotok, do_tag=opts.posmorph, do_lemma=opts.lemma,
  File "heb_pipe.py", line 678, in nlp
    entified = add_space_after(input_data,entified)
  File "F:\nlp_project\HebPipe\hebpipe\heb_pipe_env\HebPipe\hebpipe\lib\whitespace_tokenize.py", line 281, in add_space_after
    output = d.run_depedit(output,sent_text=True)
  File "F:\nlp_project\HebPipe\hebpipe\heb_pipe_env\lib\site-packages\depedit-3.2.1.0-py3.8.egg\depedit\depedit.py", line 1183, in run_depedit
IndexError: list index out of range
Elapsed time: 0:06:06.875
========================================
amir-zeldes commented 1 year ago

That looks like an input file issue, which lines up with the fact that other files seem to run through for you. It seems to happen very late in the process, after NER, while the system is figuring out alignment of the output with whitespaces in the original input. Is there any way you can share an example input file producing the error?

menachemsperka commented 1 year ago

input file with error

f3.txt

I tried truble shotting by passing thefile in small bits, f3_2.txt has 1 sentence from the text in f3.txt that is failing f3_2.txt

CRLF line endings replaced with LF - this did not changes anything


(nlp_env) F:\nlp_project\HebPipe\hebpipe>python heb_pipe.py f3_2.txt --cpu
! You selected no processing options
! Assuming you want all processing steps

Running tasks:
====================
o Automatic sentence splitting (neural)
o Whitespace tokenization
o Morphological segmentation
o POS and Morphological tagging
o Lemmatization
o Dependency parsing
o Entity recognition
o Coreference resolution

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 216kB [00:00, 11.0MB/s]
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
Processing f3_2.txt
C:\Users\msperka\AppData\Local\anaconda3\envs\nlp_env\lib\site-packages\sklearn\base.py:324: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.0.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
Traceback (most recent call last):
  File "heb_pipe.py", line 851, in <module>
    run_hebpipe()
  File "heb_pipe.py", line 828, in run_hebpipe
    processed = nlp(input_text, do_whitespace=opts.whitespace, do_tok=dotok, do_tag=opts.posmorph, do_lemma=opts.lemma,
  File "heb_pipe.py", line 678, in nlp
    entified = add_space_after(input_data,entified)
  File "F:\nlp_project\HebPipe\hebpipe\lib\whitespace_tokenize.py", line 281, in add_space_after
    output = d.run_depedit(output,sent_text=True)
  File "C:\Users\msperka\AppData\Local\anaconda3\envs\nlp_env\lib\site-packages\depedit\depedit.py", line 1183, in run_depedit
    args += (cols[8], cols[9])
IndexError: list index out of range
Elapsed time: 0:00:18.031
========================================
menachemsperka commented 1 year ago

8_83811.txt

trying to troubleshoot, as in the error message, its failing in the

 File "C:\Users\msperka\AppData\Local\anaconda3\envs\nlp_env\lib\site-packages\depedit\depedit.py", line 1183, in run_depedit
    args += (cols[8], cols[9])

i added code to export the failing "cols" to txt file see code bellow, and file attached

                if len(cols) > 8:
                    # Collect token from line; note that head2 is parsed as a string, often "_" for monoplanar trees
                    try:
                        args += (cols[8], cols[9])
                    except:
                        with open(r"F:\nlp_project\HebPipe\hebpipe\%s_%s.txt"%(i,random.randint(1, 100000)), 'w', encoding='utf-8') as f:
                            f.write(f"{len(cols)}\n")
                            for item in cols:
                                f.write(f"{item}\n")
                else:  # Attempt to read as 8 column Malt input
                    args += (cols[6], cols[7])
                    self.input_mode = "8col"

cols is length is 9, so cols[9] fails.

will it makes sense to change the condition: if len(cols) > 9: ?

amir-zeldes commented 1 year ago

OK, I figured it out, thanks. It was a bug in an underlying library. Please upgrade to the latest version and this should be resolved.