facebookresearch / CodeGen

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.
MIT License
716 stars 146 forks source link

UncompletedJobError: No output/error stream produced #73

Open sushantkumar007007 opened 2 years ago

sushantkumar007007 commented 2 years ago

I am running the CodeGen using the test repository (https://github.com/facebookresearch/CodeGen/tree/main/data/test_dataset) for obfuscation mode run codegen_sources/preprocessing/preprocess.py data/python_test --mode obfuscation --local True --local_parallelism 4 --langs python --train_splits 1 --tokenization_timeout 400 --bpe_timeout 220 --train_bpe_timeout 400 --bpe_mode fast --fastbpe_use_vocab True --fastbpe_vocab_path data/bpe/cpp-java-python/vocab --fastbpe_code_path data/bpe/cpp-java-python/codes --keep_comments False --ncodes 4000 --percent_test_valid 2

I am getting the following error,

`INFO - 05/04/22 15:56:33 - 0:00:00 - Dataset pipeline for /home/sushantk/anaconda3/codeGen/data/python_test

INFO - 05/04/22 15:56:33 - 0:00:00 - ========== Extract and Tokenize ===========
INFO - 05/04/22 15:56:33 - 0:00:00 - Using 4 processors.
INFO - 05/04/22 15:56:33 - 0:00:00 - python: tokenizing and extracting parallel functions in 1 json files ...
INFO - 05/04/22 15:56:33 - 0:00:00 - Number of lines to process: 50
WARNING - 05/04/22 15:56:33 - 0:00:01 - Error obfuscating content Missing parentheses in call to 'print'. Did you mean print('\nThe best BASE85 based alphabet for your setup is: %s' \)? (<unknown>, line 1673) 

WARNING - 05/04/22 15:56:33 - 0:00:01 - Error obfuscating content local variable 'mangledName' referenced before assignment 

WARNING - 05/04/22 15:56:33 - 0:00:01 - Error obfuscating content local variable 'mangledName' referenced before assignment 

WARNING - 05/04/22 15:56:33 - 0:00:01 - Error obfuscating content Missing parentheses in call to 'print'. Did you mean print("Press control+C to stop and show the summary")? (<unknown>, line 43) 

WARNING - 05/04/22 15:56:33 - 0:00:01 - Error obfuscating content local variable 'mangledName' referenced before assignment 

WARNING - 05/04/22 15:56:33 - 0:00:01 - Error obfuscating content local variable 'mangledName' referenced before assignment 

WARNING - 05/04/22 15:56:34 - 0:00:01 - Error obfuscating content Missing parentheses in call to 'print'. Did you mean print("permantly remove file ", file)? (<unknown>, line 374) 

WARNING - 05/04/22 15:56:34 - 0:00:01 - Error obfuscating content local variable 'mangledName' referenced before assignment 

WARNING - 05/04/22 15:56:34 - 0:00:01 - Error obfuscating content invalid syntax (<unknown>, line 426) 

WARNING - 05/04/22 15:56:34 - 0:00:01 - Error obfuscating content Missing parentheses in call to 'print'. Did you mean print("\nBEGIN - expecting GEOS_ERROR)? (<unknown>, line 135) 

WARNING - 05/04/22 15:56:34 - 0:00:01 - Error obfuscating content invalid syntax (<unknown>, line 92) 
                                        WARNING - 05/04/22 15:56:34 - 0:00:01 - Error obfuscating content invalid syntax (<unknown>, line 62) 

100%|██████████| 50/50 [00:00<00:00, 3385.62it/s]
INFO - 05/04/22 15:56:34 - 0:00:01 - Time elapsed: 0.95
WARNING - 05/04/22 15:56:34 - 0:00:01 - Tokenization of /home/sushantk/anaconda3/codeGen/data/python_test/python.001 (1).json.gz:12 errors out of 50 lines(24.00%)
WARNING - 05/04/22 15:56:34 - 0:00:01 - Tokenization of /home/sushantk/anaconda3/codeGen/data/python_test/python.001 (1).json.gz:3 filtered examples in 50 lines(6.00%)

INFO - 05/04/22 15:56:34 - 0:00:01 - ========== Deduplicate and Split ===========
INFO - 05/04/22 15:56:34 - 0:00:02 - all files python.*[0-9].obfuscated.tok regrouped in /home/sushantk/anaconda3/codeGen/data/python_test/python.all.obfuscated.tok .
INFO - 05/04/22 15:56:34 - 0:00:02 - all files python.*[0-9].dictionary.tok regrouped in /home/sushantk/anaconda3/codeGen/data/python_test/python.all.dictionary.tok .
INFO - 05/04/22 15:56:34 - 0:00:02 - shuffling 2 files parallely: python.all.obfuscated.tok, python.all.dictionary.tok
INFO - 05/04/22 15:56:34 - 0:00:02 - python: Deduplication on 'obfuscated' and propagated on other suffixes.
INFO - 05/04/22 15:56:34 - 0:00:02 - python: Duplicated lines for obfuscated: 0 / 35
INFO - 05/04/22 15:56:34 - 0:00:02 - python: valid.obfuscated -> 0 lines
INFO - 05/04/22 15:56:35 - 0:00:02 - python: test.obfuscated -> 0 lines
INFO - 05/04/22 15:56:35 - 0:00:02 - python: train.obfuscated.0 -> 35 lines
INFO - 05/04/22 15:56:35 - 0:00:02 - python: Duplicated lines for dictionary: 0 / 35
INFO - 05/04/22 15:56:35 - 0:00:02 - python: valid.dictionary -> 0 lines
INFO - 05/04/22 15:56:35 - 0:00:02 - python: test.dictionary -> 0 lines
INFO - 05/04/22 15:56:35 - 0:00:02 - python: train.dictionary.0 -> 35 lines
INFO - 05/04/22 15:56:35 - 0:00:02 - Sucessfully regroup, deduplicate and split tokenized data into a train/valid/test sets.

INFO - 05/04/22 15:56:35 - 0:00:02 - ========== Learn BPE ===========
INFO - 05/04/22 15:56:35 - 0:00:02 - No need to train bpe codes, already trained. Codes: data/bpe/cpp-java-python/codes

INFO - 05/04/22 15:56:35 - 0:00:02 - ========== Apply BPE ===========
INFO - 05/04/22 15:56:35 - 0:00:02 - Applying BPE on /home/sushantk/anaconda3/codeGen/data/python_test/python.train.dictionary.0.tok ...
INFO - 05/04/22 15:56:35 - 0:00:02 - Applying BPE on /home/sushantk/anaconda3/codeGen/data/python_test/python.train.obfuscated.0.tok ...
WARNING - 05/04/22 15:56:35 - 0:00:02 - /home/sushantk/anaconda3/codeGen/data/python_test/python.valid.dictionary.tok is not a valid file, cannot to apply BPE on it.
WARNING - 05/04/22 15:56:35 - 0:00:02 - /home/sushantk/anaconda3/codeGen/data/python_test/python.valid.obfuscated.tok is not a valid file, cannot to apply BPE on it.
WARNING - 05/04/22 15:56:35 - 0:00:02 - /home/sushantk/anaconda3/codeGen/data/python_test/python.test.dictionary.tok is not a valid file, cannot to apply BPE on it.
WARNING - 05/04/22 15:56:35 - 0:00:02 - /home/sushantk/anaconda3/codeGen/data/python_test/python.test.obfuscated.tok is not a valid file, cannot to apply BPE on it.
---------------------------------------------------------------------------
UncompletedJobError                       Traceback (most recent call last)
~/anaconda3/codeGen/codegen_sources/preprocessing/preprocess.py in <module>()
    212     args.input_path = os.path.abspath(args.input_path)
    213     multiprocessing.set_start_method("fork")
--> 214     preprocess(args)

~/anaconda3/codeGen/codegen_sources/preprocessing/preprocess.py in preprocess(args)
    103 
    104     dataset.apply_bpe(
--> 105         executor=cluster_apply_bpe, local_parallelism=args.local_parallelism
    106     )
    107     dataset.get_vocab(executor=cluster_train_bpe)

~/anaconda3/codeGen/codegen_sources/preprocessing/dataset_modes/obfuscation_mode.py in apply_bpe(self, executor, local_parallelism)
    127         _bpe_ext = self.bpe.ext
    128         self.bpe.ext += TMP_EXT
--> 129         super().apply_bpe(executor)
    130         self.bpe.ext = _bpe_ext
    131         # restore BPE on obfuscation special tokens

~/anaconda3/codeGen/codegen_sources/preprocessing/dataset_modes/dataset_mode.py in apply_bpe(self, executor, local_parallelism)
    615                 jobs.append(job)
    616         for job in jobs:
--> 617             job.result()
    618         logger.info("BPE done.")
    619         # logger.info("Regrouping BPE")

~/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/submitit/core/core.py in result(self)
    264 
    265     def result(self) -> R:
--> 266         r = self.results()
    267         assert not self._sub_jobs, "You should use `results()` if your job has subtasks."
    268         return r[0]

~/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/submitit/core/core.py in results(self)
    287             return [tp.cast(R, sub_job.result()) for sub_job in self._sub_jobs]
    288 
--> 289         outcome, result = self._get_outcome_and_result()
    290         if outcome == "error":
    291             job_exception = self.exception()

~/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/submitit/core/core.py in _get_outcome_and_result(self)
    382             else:
    383                 message.append(f"No output/error stream produced ! Check: {self.paths.stdout}")
--> 384             raise utils.UncompletedJobError("\n".join(message))
    385         try:
    386             output: tp.Tuple[str, tp.Any] = utils.pickle_load(self.paths.result_pickle)

UncompletedJobError: Job 18686 (task: 0) with path /home/sushantk/anaconda3/codeGen/data/python_test/log/18686_0_result.pkl
has not produced any output (state: FINISHED)
No output/error stream produced ! Check: /home/sushantk/anaconda3/codeGen/data/python_test/log/18686_0_log.out`

After opening the "python.test.dictionary.tok" "python.test.obfuscated.tok", "python.valid.dictionary.tok" "python.valid.obfuscated.tok" are blank, they are not producing anything.

Can you tell why this is happening??

baptisteroziere commented 2 years ago

Hi, It may be because all 35 examples in the python file you kept are sent to the training set. Maybe train running it on the 3 python files in the test dataset (it should still be quite fast) or increase --percent_test_valid to something like 10 or 20.