danielzuegner / code-transformer

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".
https://www.in.tum.de/daml/code-transformer/
MIT License
166 stars 31 forks source link

Error when preprocessing java-medium #19

Closed dmivilensky closed 2 years ago

dmivilensky commented 2 years ago

Dear Authors,

Thank you very much for your work! I used the scripts for preprocessing the java code2seq-type datasets to test the performance of the method name prediction model on the java-medium dataset later but was faced with a strange error. Everything was fine with the stage1 scripts, but after I ran the stage2 script (with a command like python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml java-medium train, in particular), I have found the process failed (after a couple of hours, with a setting batch_size: 1000, num_processes: 8) with such an error:

struct.error: 'i' format requires -2147483648 <= number <= 2147483647

The detailed traceback follows:

Spoiler

``` Traceback (most recent calls WITHOUT Sacred internals): joblib.externals.loky.process_executor._RemoteTraceback: """ Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 344, in _sendback_result exception=exception)) File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/externals/loky/backend/queues.py", line 240, in put self._writer.send_bytes(obj) File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647 """ The above exception was the direct cause of the following exception: Traceback (most recent calls WITHOUT Sacred internals): File "code_transformer/experiments/preprocessing/preprocess-2.py", line 340, in main Preprocess2Container().run() File "code_transformer/experiments/preprocessing/preprocess-2.py", line 312, in run for batch in dataset_slice) File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in __call__ self.retrieve() File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result return future.result(timeout=timeout) File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result return self.__get_result() File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result raise self._exception struct.error: 'i' format requires -2147483648 <= number <= 2147483647 ```

Unfortunately, I didn't manage to specify the procedure causing this error. So, my questions are what is the reason why this error appears in java-medium (and not appears on smaller datasets), and how can I resolve this problem (at least, what particular line may produce it – maybe I can catch an exception somewhere on some poor data)?

tobias-kirschstein commented 2 years ago

Hi Dmitry,

thank you for your interest in the Code Transformer and for reporting this issue. As far as I can tell, this is caused by the multiprocessing employed in the script. At some point, the batch sent to a subprocess seems to be too large (Stackoverflow). Essentially, the multiprocessing call in https://github.com/danielzuegner/code-transformer/blob/539742288747b3fe541575d0ee266e3c3587bfe8/code_transformer/experiments/preprocessing/preprocess-2.py#L298 forwards a batch object to self.preprocess(..) that is too large to pickle. We never observed this error in our experiments, so I hypothesize that java-medium may contain some very large methods causing the generated ASTs in stage 1 to become huge.

I see several possible solutions here:

Please let me know if you could resolve this issue.

SpirinEgor commented 2 years ago

Hi!

Thank you for your advice, they successfully help up to process the java-medium dataset. It's interesting fact that I was separately trying to work with the smaller batch sizes and different python versions, but it didn't help et all.

The working combination for me was batch size equal to 10 and python 3.8. I also use only 15 processes, like in the original configuration. In my first run, I used a higher number but decided to use a verified number in a final attempt.

Thanks!

tobias-kirschstein commented 2 years ago

Hi Egor,

The working combination for me was batch size equal to 10 and python 3.8. I also use only 15 processes, like in the original configuration. In my first run, I used a higher number but decided to use a verified number in a final attempt.

thanks for reporting back what worked for you! Sure this will help others that have a similar problem.