Closed Kadakol closed 3 years ago
Hi Chris,
A question regarding this as well.
Just FYI - I was able to trigger the reachability analysis using the command
bazel run //programl/task/dataflow:train_ggnn -- -analysis reachability --path=/path/to/datasets
and it is running as of now.
Progress has been stuck at this position since the last 24 hours.
Train to 1,000,000 graphs: 100018 graphs [11:13, 148.42 graphs/s, f1=0.955, loss=0.0162, prec=0.965, rec=0.956] Val at 1,000,000 graphs: 99%|__________| 9905/10000 [00:11<00:00, 828.37 graphs/s, f1=0.977, loss=0.0095, prec=0.999, rec=0.958] I1111 14:36:29 ggnn.py:214] Wrote /path/to/ProGraML/datasets/logs/programl/reachability/20:11:11T10:17:08/epochs/015.EpochList.pbtxt
Does that mean training is complete? Should I Ctrl + c to terminate it? Please let me know. Thank you!
Hi Akshay,
Thanks for reaching out. There are two separate issues here:
We did a big refactor of the codebase before making this repo public and haven't yet got around to fully re-implementing the scripts for devmap/classifyapp. There is an open pull request that adds the missing code here #107, but unfortunately I haven't got around to fully testing it yet so it may be a bit before it is ready to merge. Feel free to give it a go though! :)
That looks like a bug to me. I would suspect that the iterator that produces training graphs has crashed, causing the training loop to hand, though I would need more details to reproduce it locally.
As a workaround, you can Ctrl + c to terminate training at any point, then re-run the command with --restore_from=/path/to/ProGraML/datasets/logs/programl/reachability/20:11:11T10:17:08
to pick up from the most-recent checkpoint.
Cheers, Chris
Hi Chris,
Thank you for your response!
Reproducing the devmap experiments
We did a big refactor of the codebase before making this repo public and haven't yet got around to fully re-implementing the scripts for devmap/classifyapp. There is an open pull request that adds the missing code here #107, but unfortunately I haven't got around to fully testing it yet so it may be a bit before it is ready to merge. Feel free to give it a go though! :)
Understood. Let me have a look at the branch feature/classifyapp_81 then!
Training stuck
That looks like a bug to me. I would suspect that the iterator that produces training graphs has crashed, causing the training loop to hand, though I would need more details to reproduce it locally.
As a workaround, you can Ctrl + c to terminate training at any point, then re-run the command with
--restore_from=/path/to/ProGraML/datasets/logs/programl/reachability/20:11:11T10:17:08
to pick up from the most-recent checkpoint.
I tried this out, but it doesn't seem to be resuming. This looks like the same issue reported in #140. I'll track the discussion on #140 to understand how to get this working.
Hi Chris,
I've started off with feature/classifyapp_81 but there seem to be some issues.
pip install
a bunch of packages as and when the errors popped up:
RuntimeError: Detected that PyTorch and torch_sparse were compiled with different CUDA versions. PyTorch has CUDA version 10.2 and torch_sparse has CUDA version 10.0. Please reinstall the torch_sparse that match es your PyTorch install.
So I had to fix that using pip install torch-geometric torch-sparse==latest+cu101 torch-scatter==latest+cu101 -f https://pytorch-geometric.com/whl/torch-1.4.0.html
RuntimeError: Expected PyTorch version 1.4 but found version 1.6.
.
I was able to fix it using conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch
And then gave rise to a new set of issues:
python run.py -h
gave the following error
Traceback (most recent call last): File "run.py", line 60, in <module> import configs, modeling File "/path/to/ProGraML/programl/task/graph_level_classification/configs.py", line 17, in <module> from .dataset import AblationVocab ImportError: attempted relative import with no known parent package
__init__.py
file in the graph_level_classification
directory and removed the .
from all the imports starting with .
in the files in this directory. Traceback (most recent call last): File "run.py", line 61, in <module> from configs import ( File "/path/to/ProGraML/programl/task/graph_level_classification/configs.py", line 18, in <module> from dataset import AblationVocab File "/path/to/ProGraML/programl/task/graph_level_classification/dataset.py", line 19, in <module> from programl.proto.program_graph_pb2 import ProgramGraph ModuleNotFoundError: No module named 'programl.proto.program_graph_pb2'
I wasn't sure how to solve this. So I decided to take inspiration from the way the dataflow experiments are setup, i.e., use bazel
(I'm a bazel noob, btw). I created a BUILD file, filled it up, and commented out the 3 assert
statements in ProGraML/programl/task/graph_level_classification/dataset.py
. Finally, the command bazel run //programl/task/graph_level_classification:run -- -h
gave me the correct output.
Now I will proceed with trying to run the devmap experiments.
All of these seem to be environment related issues. Is my environment not set up properly? I assumed since I was able to trigger the reachability test that that should not be the case. I understand this branch is still work in progress (or it's possible that I've missed it) but maybe this particular test could benefit from some more documentation. I hope you don't mind my feedback.
I had to make a few more changes. It looks the directory structure has changed. The main changes are as follows:
In dataset.py
, I had to change the following:
-PROGRAML_VOCABULARY = REPO_ROOT / "deeplearning/ml4pl/poj104/programl_vocabulary.csv"
-CDFG_VOCABULARY = REPO_ROOT / "deeplearning/ml4pl/poj104/cdfg_vocabulary.csv"
+PROGRAML_VOCABULARY = REPO_ROOT / "datasets/vocab/programl.csv"
+CDFG_VOCABULARY = REPO_ROOT / "datasets/vocab/cdfg.csv"
In run.py
, I changed the following:
- "devmap_amd": (DevmapDataset, "deeplearning/ml4pl/poj104/devmap_data"),
- "devmap_nvidia": (DevmapDataset, "deeplearning/ml4pl/poj104/devmap_data"),
+ "devmap_amd": (DevmapDataset, "datasets/dataflow"),
+ "devmap_nvidia": (DevmapDataset, "datasets/dataflow"),
And now I've got the following error:
Number of trainable params in GGNNModel: 4,770,238 params, weights size: ~19.0MB.
Traceback (most recent call last):
File "/home/akshay/.cache/bazel/_bazel_akshay/68fc6203059f9070babbb4937a38c157/execroot/programl/bazel-out/k8-fastbuild/bin/programl/task/graph_level_classification/run.runfiles/programl/programl/task/graph_level_classification/run.py", line 1063, in <module>
current_kfold_split=split,
File "/home/akshay/.cache/bazel/_bazel_akshay/68fc6203059f9070babbb4937a38c157/execroot/programl/bazel-out/k8-fastbuild/bin/programl/task/graph_level_classification/run.runfiles/programl/programl/task/graph_level_classification/run.py", line 205, in __init__
self.load_data(dataset, args["--kfold"], current_kfold_split)
File "/home/akshay/.cache/bazel/_bazel_akshay/68fc6203059f9070babbb4937a38c157/execroot/programl/bazel-out/k8-fastbuild/bin/programl/task/graph_level_classification/run.runfiles/programl/programl/task/graph_le
vel_classification/run.py", line 331, in load_data
current_kfold_split
File "/path/to/ProGraML/programl/task/graph_level_classification/dataset.py", line 1009, in return_cross_validation_splits
train_data = self.__indexing__(train_index)
AttributeError: 'DevmapDataset' object has no attribute '__indexing__'
The command that I'm using is bazel run //programl/task/graph_level_classification:run -- --model ggnn_devmap --dataset devmap_nvidia --kfold --config_json="{'cdfg': False}"
. I tried to look at the documentation but could not find the attribute '__indexing__'
for the class InMemoryDataset
. So I'm pretty much stuck here.
Hi @Kadakol,
You're right. There was a bug in my data loader which could cause training to forever, same as #140. I have implemented a fix and merged into development
.
Thank you very much for your investigation of #107! That is all very useful feedback. Sorry for sending you down a rabbit hole - I will be taking a look at that PR and will let you know once it is passing the tests and ready to merge. I'm taking some much-needed time off this week but I will keep you posted 👍
Merging this issue into #81, please subscribe there for future updates.
Cheers, Chris
Hi Chris,
Thanks a lot for fixing #140. I'll take a look at it!
No worries. Glad I could be of some minor help! Hope you have a good break :)
I'll keep an eye on #81. Thank you!
Hi @ChrisCummins ,
Thank you so much for your work with ProGraML!
I was trying to replicate the DevMap experiments. So far, I was able to use the code you have provided to create the dataset (using the command
bazel run //programl/task/devmap/dataset:create
.However, I am not sure how to trigger the experiment itself. I tried
bazel run //programl/task/dataflow:train_ggnn -- -analysis devmap --path=/path/to/datasets
, along the same lines as the reachability analysis command, but it looks like this is not the way to do it.Could you please point me in the right direction to reproduce the DevMap results as shown in the paper?
Thank you!
Just FYI - I was able to trigger the reachability analysis using the command
bazel run //programl/task/dataflow:train_ggnn -- -analysis reachability --path=/path/to/datasets
and it is running as of now.