Alex-Fabbri / Multi-News

Large-scale multi-document summarization dataset and code
Other
274 stars 53 forks source link

Err read file .pt after preprocess #37

Closed phamkhactu closed 2 years ago

phamkhactu commented 2 years ago

Thank for great repo @Alex-Fabbri. I follow readme.txt in Hi-Map

  1. I run run_prep_newser, I have list .pt file after that i have newser_sents.vocab.pt.
  2. I run run_inference_newser.sh, but i get err:
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

    I found that error from when call func read data from newser_sent_500/newser_sents.vocab.pt file:

def make_text_iterator_from_file(path):
    with codecs.open(path, "r", "utf-8") as corpus_file:
        for line in corpus_file:
            yield line

file: code/Hi_MAP/onmt/inputters/text_dataset.py

I using raw data from Raw data -- zipped Some version from me: torch 1.8.0 torchtext 0.9.0 cuda 11.1 Many thank for your help!!