An NLP pipeline for Hebrew
getting an error only when ruining on a list of files. when ruining on that same file only it runs ok.

(nlp_env) F:\nlp_project\HebPipe\hebpipe>python  "F:\nlp_project\responsa_texts\all files\all files\*.txt"  --dirout "F:\nlp_project\responsa_texts\hebpipe_output\all files"  --cpu
! You selected no processing options
! Assuming you want all processing steps

Running tasks:
o Automatic sentence splitting (neural)
o Whitespace tokenization
o Morphological segmentation
o POS and Morphological tagging
o Lemmatization
o Dependency parsing
o Entity recognition
o Coreference resolution

Downloading 216kB [00:00, ?B/s]
Processing שו ת אבני נזר חלק אה ע סימן א.txt
Processing שו ת אבני נזר חלק אה ע סימן ב.txt
Processing שו ת אבני נזר חלק אה ע סימן ג.txt
Processing שו ת אבני נזר חלק אה ע סימן ד.txt
Processing שו ת אבני נזר חלק אה ע סימן ה.txt
Processing שו ת אבני נזר חלק אה ע סימן ו.txt
Processing שו ת אבני נזר חלק אה ע סימן ז.txt
Processing שו ת אבני נזר חלק אה ע סימן ח.txt
Processing שו ת אבני נזר חלק אה ע סימן ט.txt
Traceback (most recent call last):
  File "", line 851, in <module>
  File "", line 828, in run_hebpipe
    processed = nlp(input_text, do_whitespace=opts.whitespace, do_to
k=dotok, do_tag=opts.posmorph, do_lemma=opts.lemma,
  File "", line 613, in nlp
    tagged_conllu, tokenized, morphs, words = mtltagger.predict(tokenized,sent_tag=sent_tag,checkpointfile=model_dir + '')
  File "F:\nlp_project\HebPipe\hebpipe\lib\", line 1273, in predict
    split_indices, pos_tags, morphs, words = self.inference(no_pos_lemma,sent_tag=sent_tag,checkpointfile=checkpointfile)
  File "F:\nlp_project\HebPipe\hebpipe\lib\", line 1015, in inference
    for i in range(0, len(preds)):
TypeError: object of type 'int' has no len()
Elapsed time: 0:57:44.609

input file: שו ת אבני נזר חלק אה ע סימן ט.txt

also attached output when i run specifically on this file only


amir-zeldes commented 1 year ago

Hm, I can't actually reproduce this, even using your file and another random file - it runs on the single file and it runs in batch mode. There is a small chance somehow the bugfix for the other issue resolved it, but I don't think so. If you are still running into this problem let me know and we can try to figure it out. BTW your files are not Unicode (UTF-8), that could lead to unexpected prediction errors since the models all expect utf8.