Open 4everTheOne opened 3 years ago
Hi Afonso
Thank you! We really appreciate your interest in our work.
From your stack trace it seems like tokenizers
is having trouble locating the path to the vocab/merges files. PyTorch Lightning had an issue with storing a tokenizer in the hparams
. I'm not sure if it has been changed by I had to write a serializer/deserializer for hparams, but even with that I had some issues. I think you should take a look at your hparams
to check if the paths are correct. You might need to manually set it once you load the checkpoint.
Let me know if that works.
Thank you for the help, it fixed the issue!
Now I'm facing another issue and if you don't mind I would ask again for help.
The problem seems related with the merge files.
Exception: Error while initializing BPE: Merges text file invalid at line 2
When I look inside the target.bpe-merges.txt
and subtoken.bpe-merges.txt
there are some empty lines that doesn't appear to be right.
Am I missing some steps during the preparation of the data that causes this?
Once again, thanks for your time.
To create the files I followed this steps:
extract.py
from Code2Seq generate java-small.train.txt
, java-small.test.txt
and java-small.val.txt
.create_variable_name_dataset.py
to generate java-small.train.var.txt
, java-small.test.var.txt
and java-small.val.var.txt
.node_counts.txt
, target_counts.txt
and subtoken_counts.txt
from java-small.train.var.txt
.java-small.train.var.txt
java-small.train.var.txt
I'm glad that helped!
It seems like there is definitely an issue with target-counts.txt, since I would expect the target to be the labels (the variable/method name to be predicted), and also the blank lines in the tokenizer
files. I don't know off the top of my head what the problem could be. Both construct_counts.py
and train-bpe.py
use relatively simple logic, so the problem is probably something a little more subtle. Let me investigate and I will get back to you.
First of all, congrats on you work!
I was trying to use the trained tool to predict variable names for another dataset. But when I try to predict the values an error occurs.
By following the data processing documentation with the java-small dataset provided by Code2Seq. I have executed the following operations:
After this i have the following files:
What am I doing wrong?
Thanks for your time.
Attachments
Packages versions
``` absl-py==0.11.0 aiohttp==3.7.3 aiohttp-cors==0.7.0 aioredis==1.3.1 appdirs==1.4.4 argon2-cffi==20.1.0 async-generator==1.10 async-timeout==3.0.1 attrs==20.3.0 Automat==20.2.0 backcall==0.2.0 bleach==3.3.0 blessings==1.7 cachetools==4.2.1 certifi==2020.12.5 cffi==1.14.4 chardet==4.0.0 click==7.1.2 colorama==0.4.4 colorful==0.5.4 constantly==15.1.0 cursor==1.3.4 decorator==4.4.2 defusedxml==0.6.0 distlib==0.3.1 einops==0.3.0 entrypoints==0.3 filelock==3.0.12 fsspec==0.8.5 future==0.18.2 gensim==3.6.0 google-api-core==1.25.1 google-auth==1.24.0 google-auth-oauthlib==0.4.2 googleapis-common-protos==1.52.0 gpustat==0.6.0 grpcio==1.35.0 hiredis==1.1.0 humanfriendly==9.1 humanize==3.2.0 hyperlink==21.0.0 idna==2.10 incremental==17.5.0 iniconfig==1.1.1 interval==1.0.0 ipykernel==5.4.3 ipython==7.20.0 ipython-genutils==0.2.0 ipywidgets==7.6.3 jedi==0.18.0 Jinja2==2.11.3 joblib==1.0.0 jsonschema==3.2.0 jupyter==1.0.0 jupyter-client==6.1.11 jupyter-console==6.2.0 jupyter-core==4.7.1 jupyterlab-pygments==0.1.2 jupyterlab-widgets==1.0.0 Markdown==3.3.3 MarkupSafe==1.1.1 millify==0.1.1 mistune==0.8.4 msgpack==1.0.2 multidict==5.1.0 nbclient==0.5.1 nbconvert==6.0.7 nbformat==5.1.2 nest-asyncio==1.5.1 nltk==3.5 notebook==6.2.0 numpy==1.19.5 nvidia-ml-py3==7.352.0 oauthlib==3.1.0 opencensus==0.7.12 opencensus-context==0.1.2 packaging==20.9 pandocfilters==1.4.3 parso==0.8.1 pbr==5.5.1 pexpect==4.8.0 pickleshare==0.7.5 plac==1.3.1 pluggy==0.13.1 prometheus-client==0.9.0 prompt-toolkit==3.0.14 protobuf==3.14.0 psutil==5.8.0 ptyprocess==0.7.0 py==1.10.0 py-spy==0.3.4 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.20 Pygments==2.7.4 PyHamcrest==2.0.2 PyNaCl==1.4.0 pyparsing==2.4.7 pyrsistent==0.17.3 pytest==6.2.2 python-dateutil==2.8.1 pytorch-lightning==1.1.6 pytz==2021.1 PyYAML==5.3.1 pyzmq==22.0.2 qtconsole==5.0.2 QtPy==1.9.0 ray==1.1.0 redis==3.5.3 regex==2020.11.13 requests==2.25.1 requests-oauthlib==1.3.0 rsa==4.7 scikit-learn==0.24.1 scipy==1.6.0 Send2Trash==1.5.0 six==1.15.0 sklearn==0.0 slackweb==1.0.5 smart-open==4.1.2 spiral==1.1.0 stevedore==3.3.0 tensorboard==2.4.1 tensorboard-plugin-wit==1.8.0 termcolor==1.1.0 terminado==0.9.2 testpath==0.4.4 threadpoolctl==2.1.0 tokenizers==0.10.0 toml==0.10.2 torch==1.7.1 tornado==6.1 tqdm==4.56.0 traitlets==5.0.5 Twisted==20.3.0 typing-extensions==3.7.4.3 urllib3==1.26.3 virtualenv==20.4.2 virtualenv-clone==0.5.4 virtualenvwrapper==4.8.4 wcwidth==0.2.5 webencodings==0.5.1 Werkzeug==1.0.1 widgetsnbextension==3.5.1 yarl==1.6.3 zope.interface==5.2.0 ```Stack trace
**Note: I compressed the path to make it easier to read** ``` Traceback (most recent call last): File "./data/code2seq-pytorch/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3427, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "Eval_edit2.py
```python from pytorch_lightning import Trainer from models.Transformer.TransformerLightning import TransformerLightning import torch from utilities.config import SpecialCharacters from argparse import Namespace from utilities.print_utils import pp_prediction test_line_cache = "/home/afonso/Universidade/data/DeepTC/java-small/test-line-cache.pkl" test_path = "/home/afonso/Universidade/data/DeepTC/java-small/java-small.test.var.c2s" ckp = "./trained-models/var-name/epoch=6.ckpt" # model = TransformerLightning.load_from_checkpoint(ckp) checkpoint = torch.load(ckp) hparams = Namespace(**checkpoint["hparams"]) hparams.test_line_cache = test_line_cache hparams.test_path = test_path # print(hparams.keys()) model = TransformerLightning(hparams).cuda() model.load_state_dict(checkpoint["state_dict"]) index_to_target = {v: k for k, v in hparams.target_to_index.items()} pad_idx = hparams.target_to_index[SpecialCharacters.PAD_TOKEN] print("Starting predictions") for batch in model.test_dataloader(): print(f"making single prediction") ( labels, (start, end, path, masks, start_lengths, end_lengths, ast_path_lengths), ) = batch labels = labels.cuda() start = start.cuda() end = end.cuda() path = path.cuda() masks = masks.cuda() start_lengths = start_lengths.cuda() end_lengths = end_lengths.cuda() ast_path_lengths = ast_path_lengths.cuda() batch = ( labels, (start, end, path, masks, start_lengths, end_lengths, ast_path_lengths), ) predictions, raw = model.predict(batch) print(f"predictions complete") predictions = predictions.tolist() labels = labels.tolist() metrics = model.get_metrics(labels, predictions, pad_idx) prediction = [index_to_target[k] for k in predictions[0]] label = [index_to_target[k] for k in labels[0]] # print(predictions[0]) # print(label[0]) pp_prediction(prediction, label,) print(metrics) ```Packages versions
``` absl-py==0.11.0 aiohttp==3.7.3 aiohttp-cors==0.7.0 aioredis==1.3.1 appdirs==1.4.4 argon2-cffi==20.1.0 async-generator==1.10 async-timeout==3.0.1 attrs==20.3.0 Automat==20.2.0 backcall==0.2.0 bleach==3.3.0 blessings==1.7 cachetools==4.2.1 certifi==2020.12.5 cffi==1.14.4 chardet==4.0.0 click==7.1.2 colorama==0.4.4 colorful==0.5.4 constantly==15.1.0 cursor==1.3.4 decorator==4.4.2 defusedxml==0.6.0 distlib==0.3.1 einops==0.3.0 entrypoints==0.3 filelock==3.0.12 fsspec==0.8.5 future==0.18.2 gensim==3.6.0 google-api-core==1.25.1 google-auth==1.24.0 google-auth-oauthlib==0.4.2 googleapis-common-protos==1.52.0 gpustat==0.6.0 grpcio==1.35.0 hiredis==1.1.0 humanfriendly==9.1 humanize==3.2.0 hyperlink==21.0.0 idna==2.10 incremental==17.5.0 iniconfig==1.1.1 interval==1.0.0 ipykernel==5.4.3 ipython==7.20.0 ipython-genutils==0.2.0 ipywidgets==7.6.3 jedi==0.18.0 Jinja2==2.11.3 joblib==1.0.0 jsonschema==3.2.0 jupyter==1.0.0 jupyter-client==6.1.11 jupyter-console==6.2.0 jupyter-core==4.7.1 jupyterlab-pygments==0.1.2 jupyterlab-widgets==1.0.0 Markdown==3.3.3 MarkupSafe==1.1.1 millify==0.1.1 mistune==0.8.4 msgpack==1.0.2 multidict==5.1.0 nbclient==0.5.1 nbconvert==6.0.7 nbformat==5.1.2 nest-asyncio==1.5.1 nltk==3.5 notebook==6.2.0 numpy==1.19.5 nvidia-ml-py3==7.352.0 oauthlib==3.1.0 opencensus==0.7.12 opencensus-context==0.1.2 packaging==20.9 pandocfilters==1.4.3 parso==0.8.1 pbr==5.5.1 pexpect==4.8.0 pickleshare==0.7.5 plac==1.3.1 pluggy==0.13.1 prometheus-client==0.9.0 prompt-toolkit==3.0.14 protobuf==3.14.0 psutil==5.8.0 ptyprocess==0.7.0 py==1.10.0 py-spy==0.3.4 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.20 Pygments==2.7.4 PyHamcrest==2.0.2 PyNaCl==1.4.0 pyparsing==2.4.7 pyrsistent==0.17.3 pytest==6.2.2 python-dateutil==2.8.1 pytorch-lightning==1.1.6 pytz==2021.1 PyYAML==5.3.1 pyzmq==22.0.2 qtconsole==5.0.2 QtPy==1.9.0 ray==1.1.0 redis==3.5.3 regex==2020.11.13 requests==2.25.1 requests-oauthlib==1.3.0 rsa==4.7 scikit-learn==0.24.1 scipy==1.6.0 Send2Trash==1.5.0 six==1.15.0 sklearn==0.0 slackweb==1.0.5 smart-open==4.1.2 spiral==1.1.0 stevedore==3.3.0 tensorboard==2.4.1 tensorboard-plugin-wit==1.8.0 termcolor==1.1.0 terminado==0.9.2 testpath==0.4.4 threadpoolctl==2.1.0 tokenizers==0.10.0 toml==0.10.2 torch==1.7.1 tornado==6.1 tqdm==4.56.0 traitlets==5.0.5 Twisted==20.3.0 typing-extensions==3.7.4.3 urllib3==1.26.3 virtualenv==20.4.2 virtualenv-clone==0.5.4 virtualenvwrapper==4.8.4 wcwidth==0.2.5 webencodings==0.5.1 Werkzeug==1.0.1 widgetsnbextension==3.5.1 yarl==1.6.3 zope.interface==5.2.0 ```