Error when preprocessing java-medium

dmivilensky commented 2 years ago

Dear Authors,

Thank you very much for your work! I used the scripts for preprocessing the java code2seq-type datasets to test the performance of the method name prediction model on the java-medium dataset later but was faced with a strange error. Everything was fine with the stage1 scripts, but after I ran the stage2 script (with a command like python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml java-medium train, in particular), I have found the process failed (after a couple of hours, with a setting batch_size: 1000, num_processes: 8) with such an error:

struct.error: 'i' format requires -2147483648 <= number <= 2147483647

The detailed traceback follows:

Spoiler

``` Traceback (most recent calls WITHOUT Sacred internals): joblib.externals.loky.process_executor._RemoteTraceback: """ Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 344, in _sendback_result exception=exception)) File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/externals/loky/backend/queues.py", line 240, in put self._writer.send_bytes(obj) File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647 """ The above exception was the direct cause of the following exception: Traceback (most recent calls WITHOUT Sacred internals): File "code_transformer/experiments/preprocessing/preprocess-2.py", line 340, in main Preprocess2Container().run() File "code_transformer/experiments/preprocessing/preprocess-2.py", line 312, in run for batch in dataset_slice) File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in __call__ self.retrieve() File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result return future.result(timeout=timeout) File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result return self.__get_result() File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result raise self._exception struct.error: 'i' format requires -2147483648 <= number <= 2147483647 ```

Unfortunately, I didn't manage to specify the procedure causing this error. So, my questions are what is the reason why this error appears in java-medium (and not appears on smaller datasets), and how can I resolve this problem (at least, what particular line may produce it – maybe I can catch an exception somewhere on some poor data)?

tobias-kirschstein commented 2 years ago

Hi Dmitry,

thank you for your interest in the Code Transformer and for reporting this issue. As far as I can tell, this is caused by the multiprocessing employed in the script. At some point, the batch sent to a subprocess seems to be too large (Stackoverflow). Essentially, the multiprocessing call in https://github.com/danielzuegner/code-transformer/blob/539742288747b3fe541575d0ee266e3c3587bfe8/code_transformer/experiments/preprocessing/preprocess-2.py#L298 forwards a batch object to self.preprocess(..) that is too large to pickle. We never observed this error in our experiments, so I hypothesize that java-medium may contain some very large methods causing the generated ASTs in stage 1 to become huge.

I see several possible solutions here:

Try Python 3.8. Pickling has been changed there
Try reducing the batch_size for preprocessing. Although this will probably slow down the execution
Try filtering super large methods. You might not use them later anyway. The easiest way to do this is probably by simply iterating over the samples right before the call to execute_parallel(..) in https://github.com/danielzuegner/code-transformer/blob/539742288747b3fe541575d0ee266e3c3587bfe8/code_transformer/experiments/preprocessing/preprocess-2.py#L298

Please let me know if you could resolve this issue.

SpirinEgor commented 2 years ago

Hi!

Thank you for your advice, they successfully help up to process the java-medium dataset. It's interesting fact that I was separately trying to work with the smaller batch sizes and different python versions, but it didn't help et all.

The working combination for me was batch size equal to 10 and python 3.8. I also use only 15 processes, like in the original configuration. In my first run, I used a higher number but decided to use a verified number in a final attempt.

Thanks!

tobias-kirschstein commented 2 years ago

Hi Egor,

The working combination for me was batch size equal to 10 and python 3.8. I also use only 15 processes, like in the original configuration. In my first run, I used a higher number but decided to use a verified number in a final attempt.

thanks for reporting back what worked for you! Sure this will help others that have a similar problem.

danielzuegner / code-transformer

Error when preprocessing java-medium #19