Closed ahalterman closed 4 years ago
If you end up splitting the output files in the 01_parse.py
script, you can easily run the preprocessing script over each of them using GNU parallel
:
find docbins/ -name '*.spacy' | parallel --jobs 10 python sense2vec/scripts/02_preprocess.py {} s2v_format/ en_core_web_sm
I had the same problem. See the error message below. After doing some more preprocessing, however, I no longer get the "bytes object is too large" ValueError. Preprocessing steps: (1) removed duplicates, (2) stripped whitespace from sentence end, (3) removed sentences of length > 2520 characters, (4) removed sentences of length < 11 characters. These 4 steps cut my dataset by 74% from 7,487,357 sentences to 1,978,295. So, I'm not sure which of those steps fixed the problem, but I no longer get the "bytes object is too large" ValueError.
~/sense2vec$ python scripts/01_parse.py ../corpus2.txt ../corpus_parsed2 en_core_web_lg --n 14
✔ Created output directory ../corpus_parsed2
ℹ Using spaCy model en_core_web_lg
Preprocessing text...
Docs: 7487357 [57:44, 2161.41/s]
✔ Processed 7487357 docs
Traceback (most recent call last):
File "01_parse.py", line 45, in <module>
plac.call(main)
File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "01_parse.py", line 37, in main
doc_bin_bytes = doc_bin.to_bytes()
File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\spacy\tokens\_serialize.py", line 151, in to_bytes
return zlib.compress(srsly.msgpack_dumps(msg))
File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\srsly\_msgpack_api.py", line 16, in msgpack_dumps
return msgpack.dumps(data, use_bin_type=True)
File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\srsly\msgpack\__init__.py", line 40, in packb
return Packer(**kwargs).pack(o)
File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
File "_packer.pyx", line 235, in srsly.msgpack._packer.Packer._pack
File "_packer.pyx", line 206, in srsly.msgpack._packer.Packer._pack
ValueError: bytes object is too large
How big are each of your documents? Is each one a sentence or is it more like a news article? Mine are around 500 words/3000-4000 characters, so if yours are sentence-length that could keep you below the memory limit. (That could also explain why you're getting 2,000 docs/second and I'm getting 100/second on 14 cores.)
In general, though, it's not ideal to have to trim the corpus to prevent an out-of-memory error. I'm about to train vectors on a much larger corpus of text so I'll see how the splitting solution in #103 works.
How big are each of your documents? Is each one a sentence or is it more like a news article?
Each of my documents is a sentence that is 120 characters, on average. So, I agree with your statements.
I've been getting a "bytes object is too large" error when processing a large-ish number of documents using the
01_parse.py
script. Creating several smallerdoc_bin
objects resolves the issue. Full error: