explosion / sense2vec

🦆 Contextually-keyed word vectors
https://explosion.ai/blog/sense2vec-reloaded
MIT License
1.62k stars 240 forks source link

error on 01_parse.py #128

Closed danielmoore19 closed 3 years ago

danielmoore19 commented 3 years ago

UnboundLocalError: local variable 'output_file' referenced before assignment

Traceback (most recent call last):
  File "scripts/01_parse.py", line 61, in <module>
    plac.call(main)
  File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "scripts/01_parse.py", line 53, in main
    with output_file.open("wb") as f:

when i looked in the parse.py code, it appears that output_file is not always created:

       for doc in tqdm.tqdm(docs, desc="Docs", unit=""):
            if count < max_docs:
                doc_bin.add(doc)
                count += 1
            else:
                batch_num += 1
                count = 0
                msg.good(f"Processed {len(doc_bin)} docs")
                doc_bin_bytes = doc_bin.to_bytes()
                output_file = output_path / f"{input_path.stem}-{batch_num}.spacy"
                with output_file.open("wb") as f:
                    f.write(doc_bin_bytes)
                msg.good(f"Saved parsed docs to file", output_file.resolve())
                doc_bin = DocBin(attrs=["POS", "TAG", "DEP", "ENT_TYPE", "ENT_IOB"])
        with output_file.open("wb") as f:

so if the doc count is lower than the max_docs setting, output_file is never created. obviously it is simple to reduce the max_doc setting and force the else chain. but it would seem the output_file should always be created, correct?

ericfeunekes commented 3 years ago

I noticed this too. I think the issue is actually the second with being too early. Currently it's written as

with input_path.open("r", encoding="utf8") as texts:
        docs = nlp.pipe(texts, n_process=n_process)
        for doc in tqdm.tqdm(docs, desc="Docs", unit=""):
            if count < max_docs:
                doc_bin.add(doc)
                count += 1
            else:
                batch_num += 1
                count = 0
                msg.good(f"Processed {len(doc_bin)} docs")
                doc_bin_bytes = doc_bin.to_bytes()
                output_file = output_path / f"{input_path.stem}-{batch_num}.spacy"
                with output_file.open("wb") as f:
                    f.write(doc_bin_bytes)
                msg.good(f"Saved parsed docs to file", output_file.resolve())
                doc_bin = DocBin(attrs=["POS", "TAG", "DEP", "ENT_TYPE", "ENT_IOB"])
        with output_file.open("wb") as f:
            batch_num += 1
            output_file = output_path / f"{input_path.stem}-{batch_num}.spacy"
            doc_bin_bytes = doc_bin.to_bytes()
            f.write(doc_bin_bytes)
            msg.good(
                f"Complete. Saved final parsed docs to file", output_file.resolve()
            )

Where it should be

                with output_file.open("wb") as f:
                    f.write(doc_bin_bytes)
                msg.good(f"Saved parsed docs to file", output_file.resolve())
                doc_bin = DocBin(attrs=["POS", "TAG", "DEP", "ENT_TYPE", "ENT_IOB"])
        batch_num += 1
        output_file = output_path / f"{input_path.stem}-{batch_num}.spacy"
        doc_bin_bytes = doc_bin.to_bytes()
        with output_file.open("wb") as f:
            f.write(doc_bin_bytes)
        msg.good(f"Complete. Saved final parsed docs to file", output_file.resolve())

The current code actually won't save the last output_file regardless of the doc count because the last output file is never opened (that last with output_file either opens the second to last output_file or you get the UnboundLocalError. I'm submitting a pull request to fix it.

danielmoore19 commented 3 years ago

this solves issue. follow ericfeunekes post.