I keep running out of memory when trying to prepare the data. I have 16gb RAM and 1gb of swap, and I've managed to prepare data with 2 million messages and a vocab of 40k before, now i have 40 million, and I've been gradually reducing the vocab size from 40k, to now 20k, but still seem to be running into the problem.
I remember Sentdex describing using a lot of Reddit data, and still increasing the 20k vocab size if you're over 4gb, so I'm not sure if this is normal, and what i could possibly do for the best results, reduce the amount of data, or reduce vocab even farther?
Its also been a bit buggy for me in general, so it might be some sort of issue on my end, i had to compile my own version of tensorflow because of some unsupported instruction set and mess with python version control.
Thanks.
Learning BPE
Building temporary vocab (from)
^M0 tokens [00:00, ? tokens/s]^M3 tokens [00:00, 9.48 tokens/s]^M3472 tokens [00:00, 13.54 tokens/s]^M6698 tokens [00:00, 19.33 tokens/s]^M9890 to$
Learning BPE for vocab of 20000 tokens
^M 0%| | 0/20000 [00:00<?, ? tokens/s]^M 4%|3 | 739/20000 [00:17<07:27, 43.09 tokens/s]^M 4%|3 | 740/20000 [00:19<08:14$
Applying BPE
File: train.bpe.from
^M 0%| | 0/29196705 [00:00<?, ? lines/s]Exception in thread Thread-3931:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.7/multiprocessing/pool.py", line 412, in _handle_workers
pool._maintain_pool()
File "/usr/lib/python3.7/multiprocessing/pool.py", line 248, in _maintain_pool
self._repopulate_pool()
File "/usr/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
w.start()
File "/usr/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
I keep running out of memory when trying to prepare the data. I have 16gb RAM and 1gb of swap, and I've managed to prepare data with 2 million messages and a vocab of 40k before, now i have 40 million, and I've been gradually reducing the vocab size from 40k, to now 20k, but still seem to be running into the problem.
I remember Sentdex describing using a lot of Reddit data, and still increasing the 20k vocab size if you're over 4gb, so I'm not sure if this is normal, and what i could possibly do for the best results, reduce the amount of data, or reduce vocab even farther?
Its also been a bit buggy for me in general, so it might be some sort of issue on my end, i had to compile my own version of tensorflow because of some unsupported instruction set and mess with python version control.
Thanks.