explosion / sense2vec

🦆 Contextually-keyed word vectors
https://explosion.ai/blog/sense2vec-reloaded
MIT License
1.62k stars 239 forks source link

Help Re-writing 04_fasttext_train_vectors.py for Windows 10 Compatibility #105

Closed dshefman1 closed 4 years ago

dshefman1 commented 4 years ago

The two os.system(cmd) portions of the 04_fasttext_train_vectors.py script on lines 61-67 and lines 75-81 do not work for Windows users. So, I re-wrote lines 61-67 using the FastText Word representations documentation. However, lines 75-81 are proving more difficult to rewrite because I don't know the structure of the vocab_file output file created with lines 76-78, included below. The 3rd-to-last line of my code below uses the save_model function to save the model to a binary file for later loading as shown here. However, this is not the input file format expected in 05_export.py. Could you please provide a sample of what the vocab_file output file looks like? Or, better yet, do you have any suggestions for how to replace lines 75-78 that doesn't involve using os.system, CLI code, or the fasttext_bin.

Lines 76-78 of 04_fasttext_train_vectors.py:

vocab_file = output_path / "vocab.txt"
cmd = f"{fasttext_bin} dump {output_file.with_suffix('.bin')} dict > {vocab_file}"
print(cmd)
vocab_cmd = os.system(cmd)

Here is the code I used in place of 04_fasttext_train_vectors.py to make it Windows compatible:

from pathlib import Path
from wasabi import msg
import fasttext

in_dir = "./corpus_parsed3"
out_dir = "./fasttext_model3"
n_threads = 26
min_count = 50
vector_size = 300
verbose = 2

input_path = Path(in_dir)
output_path = Path(out_dir)
if not input_path.exists() or not input_path.is_dir():
    msg.fail("Not a valid input directory", in_dir, exits=1)
if not output_path.exists():
    output_path.mkdir(parents=True)
    msg.good(f"Created output directory {out_dir}")
output_file = output_path / f"vectors_w2v_{vector_size}dim.bin"

# fastText expects only one input file and only reads from disk and not
# stdin, so we need to create a temporary file that concatenates the inputs
tmp_path = input_path / "s2v_input.tmp"
input_files = [p for p in input_path.iterdir() if p.suffix == ".s2v"]
if not input_files:
    msg.fail("Input directory contains no .s2v files", in_dir, exits=1)
with tmp_path.open("a", encoding="utf8") as tmp_file:
    for input_file in input_files:
        with input_file.open("r", encoding="utf-8") as f:
            tmp_file.write(f.read())
msg.info("Created temporary merged input file", tmp_path)

sense2vec_model = fasttext.train_unsupervised(in_dir+"/s2v_input.tmp", thread=n_threads, epoch=5, dim=vector_size, minn=0, maxn=0, minCount=min_count, verbose=verbose)
sense2vec_model.save_model(out_dir+f"/vectors_w2v_{vector_size}dim.bin")

tmp_path.unlink()
msg.good("Deleted temporary input file", tmp_path)
dshefman1 commented 4 years ago

I was able to solve the problem by inferring the format of both vectors.txt and vocab.txt from the 05_export.py script _get_shape and read_vocab functions. Also, in the 05_export.py script I had to change line 15 to first_line = next(file_).replace('\ufeff','').split() because of Windows' UTF-8 BOM signature included at the beginning of UTF-8 text docs.

Here is a complete Windows 10 compatible 04_fasttext_train_vectors.py script:

from pathlib import Path
from wasabi import msg
import fasttext

in_dir = "./corpus_parsed3"
out_dir = "./fasttext_model3"
n_threads = 26
min_count = 50
vector_size = 300
verbose = 2

input_path = Path(in_dir)
output_path = Path(out_dir)
if not input_path.exists() or not input_path.is_dir():
    msg.fail("Not a valid input directory", in_dir, exits=1)
if not output_path.exists():
    output_path.mkdir(parents=True)
    msg.good(f"Created output directory {out_dir}")
output_file = output_path / f"vectors_w2v_{vector_size}dim.bin"

# fastText expects only one input file and only reads from disk and not
# stdin, so we need to create a temporary file that concatenates the inputs
tmp_path = input_path / "s2v_input.tmp"
input_files = [p for p in input_path.iterdir() if p.suffix == ".s2v"]
if not input_files:
    msg.fail("Input directory contains no .s2v files", in_dir, exits=1)
with tmp_path.open("a", encoding="utf8") as tmp_file:
    for input_file in input_files:
        with input_file.open("r", encoding="utf-8") as f:
            tmp_file.write(f.read())
msg.info("Created temporary merged input file", tmp_path)

sense2vec_model = fasttext.train_unsupervised(in_dir+"/s2v_input.tmp", thread=n_threads, epoch=5, dim=vector_size, minn=0, maxn=0, minCount=min_count, verbose=verbose)
# sense2vec_model.save_model(out_dir+f"/vectors_w2v_{vector_size}dim.bin")

tmp_path.unlink()
msg.good("Deleted temporary input file", tmp_path)

words, freqs = sense2vec_model.get_words(include_freq=True)

with open("./fasttext_model3/vocab.txt", 'w', encoding='utf-8') as f:
    for i in range(len(words)):
        f.write(words[i] + " " + str(freqs[i]) + " word\n")

# https://stackoverflow.com/questions/58337469/how-to-save-fasttext-model-in-vec-format
# get all words from model
words = sense2vec_model.get_words()
# print(str(len(words)) + " " + str(sense2vec_model.get_dimension()))
# line by line, you append vectors to VEC file
with open("./fasttext_model3/vectors.txt", 'w', encoding='utf-8') as file_out:
    file_out.write(str(len(words)) + " " + str(sense2vec_model.get_dimension())+'\n')
    for w in words:
        v = sense2vec_model.get_word_vector(w)
        vstr = ""
        for vi in v:
            vstr += " " + str(vi)
        try:
            file_out.write(w + vstr+'\n')
        except:
            pass
svlandeg commented 4 years ago

@dshefman1 : thanks for this! Do you think it's possible to have one version of the script that works across all platforms? It would be great to incorporate your changes into the source here, so others don't run into the same problems. Would you feel like contributing a PR?

Z-e-e commented 4 years ago

@dshefman1 : could you please elaborate on the in_dir, when referring to the dir where the .s2v file is saved, I get a "SystemExit: 1" and "✘ Not a valid input directory".

Using Jupyter.

dshefman1 commented 4 years ago

@svlandeg

thanks for this! Do you think it's possible to have one version of the script that works across all platforms?

I do think it is possible.

It would be great to incorporate your changes into the source here, so others don't run into the same problems. Would you feel like contributing a PR?

I'd be happy to contribute a PR.

dshefman1 commented 4 years ago

could you please elaborate on the in_dir, when referring to the dir where the .s2v file is saved, I get a "SystemExit: 1" and "✘ Not a valid input directory".

@Z-e-e As per the docstring in the original script, it expects a "directory of preprocessed .s2v input files, will concatenate them (using a temporary file on disk) and will use fastText to train a word2vec model."

You may be making the same mistake I made the first time I ran the script, which is to provide a string reference to the filepath of the .s2v file. Instead, you are expected to provide a string reference to the directory, which in my case is "./corpus_parsed3", but it's probably named something else on your machine.

svlandeg commented 4 years ago

I'd be happy to contribute a PR.

Awesome. Reopening this to track the progress.

dshefman1 commented 4 years ago

@svlandeg I'm struggling to finalize my code to resolve Issue 105. The problem I am having is that I still don't quite understand the outputs from the CLI commands in 04_fasttext_train_vectors.py. For example, line 63 or Line 76 appears to create a .bin file of the FastText model. However, the 05_export.py script does not take this file as an input. So, I'm not sure if the purpose of creating the model.bin file is just for the purpose of creating the "vocab.txt" file on line 76. If so, then I plan to make saving the model to disk as an option, but not necessary. However, if the model.bin file is for another purpose related to the 05_export.py script please let me know.

Also, 05_export.py expects a "vectors.txt" file as an input. However, the 04_fasttext_train_vectors.py script does not explicitly create a "vectors.txt" file in the same way it explicitly creates a vocab.txt on lines 75-77, I'm assuming that the "vectors.txt" file is created on line 61-64. Am I correct? If not, could you help me understand at what point the "vectors.txt" file is created? This would help me ensure that I don't cause a new problem while fixing the Windows compatibility problem.

svlandeg commented 4 years ago

I'll have a detailed look tomorrow ! [EDIT: update, sorry, something more urgent came up, but will definitely have a look in the coming week ;-)]

svlandeg commented 4 years ago

@Z-e-e : cf. PR https://github.com/explosion/sense2vec/pull/106

[EDIT: this was a reply to a question asking which changes exactly were made by @dshefman1. Though that question has now been deleted, it's still good to link the relevant PR to this Issue :-)]

svlandeg commented 4 years ago

Hi @dshefman1, I finally had some time to look into this in more detail.

The main reason why the scripts are "incompatible" with Windows, is because you should be able to build the binary file from the fasttext github repo. This should be doable with the instructions given for cmake.

However, the other option is also to just download the binary files from the unofficial release for Windows: https://github.com/xiamx/fastText/releases. That works for me on Windows just fine. I also didn't run into any trouble with the BOM etc.

I think this may be the best option in the end, as the changes you started making to the script were quite big, and we need to make sure that it keeps working also on other platforms. What do you think ?

dshefman1 commented 4 years ago

Hi @svlandeg It sounds like you are saying that the current version of 04_fasttext_train_vectors.py already meets your criteria "to have one version of the script that works across all platforms." If so, then it sounds like a fine solution to me.

svlandeg commented 4 years ago

I mean, ideally, we wouldn't need to depend on a platform-dependent binary file. But it looks like working around it gets quite involved, and I'm also not sure what all the different intermediate files are for. So I'm just wondering whether it's worth putting more time into this if we can use that unofficial Windows release instead?

dshefman1 commented 4 years ago

@svlandeg I was able to answer my own intermediate-files questions from before. So, the intermediate files are not an issue for the code I wrote. The code I wrote provides all of the appropriate inputs for the 05_export.py, but it does need to be tested for non-Windows users. However, if you think it is not worth putting more time into this then I can get on board with that. I have plenty of high priority work to keep me busy these days.

svlandeg commented 4 years ago

@svlandeg I was able to answer my own intermediate-files questions from before. So, the intermediate files are not an issue for the code I wrote.

Oh, OK, I thought you were still having open issues! So basically the PR works for you as-is on Windows?

dshefman1 commented 4 years ago

@svlandeg Yes, it does. Sorry. That is my mistake for not reporting back that the PR works as-is on Windows. Also, since the PR uses the pip installed fastText library, instead of a binary build of fastText, I would think that it would work with almost any operating system, but I don't have a non-Windows machine to test it on.