NougatCA / FineTuner

GNU General Public License v3.0
22 stars 3 forks source link

Large Dataset Multiprocessing Issue #14

Open FahadEbrahim opened 1 year ago

FahadEbrahim commented 1 year ago

When I try to have a new large dataset (15M pairs) to test code clones on different models, I get an error related to multiprocessing encoding. Any ideas or suggestions? I suppose it's related to the dataset being large and the CPU freezes dealing with it. I tried reducing the batch size and max_length and still, the problem persists. The system I'm using is Linux.

Error Message:

Killed [usr]$ Process ForkPoolWorker-3: Traceback (most recent call last): File "/usr/lib64/python3.9/multiprocessing/pool.py", line 131, in worker put((job, i, result)) File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put self._writer.send_bytes(obj) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 409, in _send_bytes self._send(buf) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/lib64/python3.9/multiprocessing/pool.py", line 136, in worker put((job, i, (False, wrapped))) File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put self._writer.send_bytes(obj) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes self._send(header) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe Process ForkPoolWorker-2: Traceback (most recent call last): File "/usr/lib64/python3.9/multiprocessing/pool.py", line 131, in worker put((job, i, result)) File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put self._writer.send_bytes(obj) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes self._send(header) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/lib64/python3.9/multiprocessing/pool.py", line 136, in worker put((job, i, (False, wrapped))) File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put self._writer.send_bytes(obj) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes self._send(header) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe Process ForkPoolWorker-1: Traceback (most recent call last): File "/usr/lib64/python3.9/multiprocessing/pool.py", line 131, in worker put((job, i, result)) File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put self._writer.send_bytes(obj) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes self._send(header) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/lib64/python3.9/multiprocessing/pool.py", line 136, in worker put((job, i, (False, wrapped))) File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put self._writer.send_bytes(obj) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes self._send(header) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe Process ForkPoolWorker-4: Traceback (most recent call last): File "/usr/lib64/python3.9/multiprocessing/pool.py", line 131, in worker put((job, i, result)) File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put self._writer.send_bytes(obj) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes self._send(header) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/lib64/python3.9/multiprocessing/pool.py", line 136, in worker put((job, i, (False, wrapped))) File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put self._writer.send_bytes(obj) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes self._send(header) File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

NougatCA commented 1 year ago

Hi @FahadEbrahim

From the error message and my experience, this should be a problem with multi-thread processing on large datasets. You can try to set the argument single_thread to True when calling multiprocess_encoding function in Line 407, src/data.py, which will use single thread to process the dataset.

FahadEbrahim commented 1 year ago

@NougatCA Thank you for the suggestion.

Unfortunately, the same thing happens and the primary process gets killed. I'll try to chunk the dataset and see how it goes.