Open sushantkumar007007 opened 2 years ago
Hi,
It may be because all 35 examples in the python file you kept are sent to the training set.
Maybe train running it on the 3 python files in the test dataset (it should still be quite fast) or increase --percent_test_valid
to something like 10 or 20.
I am running the CodeGen using the test repository (https://github.com/facebookresearch/CodeGen/tree/main/data/test_dataset) for obfuscation mode
run codegen_sources/preprocessing/preprocess.py data/python_test --mode obfuscation --local True --local_parallelism 4 --langs python --train_splits 1 --tokenization_timeout 400 --bpe_timeout 220 --train_bpe_timeout 400 --bpe_mode fast --fastbpe_use_vocab True --fastbpe_vocab_path data/bpe/cpp-java-python/vocab --fastbpe_code_path data/bpe/cpp-java-python/codes --keep_comments False --ncodes 4000 --percent_test_valid 2
I am getting the following error,
After opening the "python.test.dictionary.tok" "python.test.obfuscated.tok", "python.valid.dictionary.tok" "python.valid.obfuscated.tok" are blank, they are not producing anything.
Can you tell why this is happening??