Running out of memory when Applying BPE

I keep running out of memory when trying to prepare the data. I have 16gb RAM and 1gb of swap, and I've managed to prepare data with 2 million messages and a vocab of 40k before, now i have 40 million, and I've been gradually reducing the vocab size from 40k, to now 20k, but still seem to be running into the problem.

I remember Sentdex describing using a lot of Reddit data, and still increasing the 20k vocab size if you're over 4gb, so I'm not sure if this is normal, and what i could possibly do for the best results, reduce the amount of data, or reduce vocab even farther?

Its also been a bit buggy for me in general, so it might be some sort of issue on my end, i had to compile my own version of tensorflow because of some unsupported instruction set and mess with python version control.

Thanks.

Learning BPE Building temporary vocab (from) ^M0 tokens [00:00, ? tokens/s]^M3 tokens [00:00, 9.48 tokens/s]^M3472 tokens [00:00, 13.54 tokens/s]^M6698 tokens [00:00, 19.33 tokens/s]^M9890 to$ Learning BPE for vocab of 20000 tokens ^M 0%| | 0/20000 [00:00<?, ? tokens/s]^M 4%|3 | 739/20000 [00:17<07:27, 43.09 tokens/s]^M 4%|3 | 740/20000 [00:19<08:14$ Applying BPE File: train.bpe.from ^M 0%| | 0/29196705 [00:00<?, ? lines/s]Exception in thread Thread-3931: Traceback (most recent call last): File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/usr/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.7/multiprocessing/pool.py", line 412, in _handle_workers pool._maintain_pool() File "/usr/lib/python3.7/multiprocessing/pool.py", line 248, in _maintain_pool self._repopulate_pool() File "/usr/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool w.start() File "/usr/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/usr/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

daniel-kukiela / nmt-chatbot

Running out of memory when Applying BPE #149