ChrisCummins / ProGraML

A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations
Other
304 stars 63 forks source link

DevMap experiments #144

Closed Kadakol closed 3 years ago

Kadakol commented 4 years ago

Hi @ChrisCummins ,

Thank you so much for your work with ProGraML!

I was trying to replicate the DevMap experiments. So far, I was able to use the code you have provided to create the dataset (using the command bazel run //programl/task/devmap/dataset:create.

However, I am not sure how to trigger the experiment itself. I tried bazel run //programl/task/dataflow:train_ggnn -- -analysis devmap --path=/path/to/datasets, along the same lines as the reachability analysis command, but it looks like this is not the way to do it.

Could you please point me in the right direction to reproduce the DevMap results as shown in the paper?

Thank you!

Just FYI - I was able to trigger the reachability analysis using the command bazel run //programl/task/dataflow:train_ggnn -- -analysis reachability --path=/path/to/datasets and it is running as of now.

Kadakol commented 4 years ago

Hi Chris,

A question regarding this as well.

Just FYI - I was able to trigger the reachability analysis using the command bazel run //programl/task/dataflow:train_ggnn -- -analysis reachability --path=/path/to/datasets and it is running as of now.

Progress has been stuck at this position since the last 24 hours.

Train to 1,000,000 graphs: 100018 graphs [11:13, 148.42 graphs/s, f1=0.955, loss=0.0162, prec=0.965, rec=0.956] Val at 1,000,000 graphs: 99%|__________| 9905/10000 [00:11<00:00, 828.37 graphs/s, f1=0.977, loss=0.0095, prec=0.999, rec=0.958] I1111 14:36:29 ggnn.py:214] Wrote /path/to/ProGraML/datasets/logs/programl/reachability/20:11:11T10:17:08/epochs/015.EpochList.pbtxt

Does that mean training is complete? Should I Ctrl + c to terminate it? Please let me know. Thank you!

ChrisCummins commented 4 years ago

Hi Akshay,

Thanks for reaching out. There are two separate issues here:

Reproducing the devmap experiments

We did a big refactor of the codebase before making this repo public and haven't yet got around to fully re-implementing the scripts for devmap/classifyapp. There is an open pull request that adds the missing code here #107, but unfortunately I haven't got around to fully testing it yet so it may be a bit before it is ready to merge. Feel free to give it a go though! :)

Training stuck

That looks like a bug to me. I would suspect that the iterator that produces training graphs has crashed, causing the training loop to hand, though I would need more details to reproduce it locally.

As a workaround, you can Ctrl + c to terminate training at any point, then re-run the command with --restore_from=/path/to/ProGraML/datasets/logs/programl/reachability/20:11:11T10:17:08 to pick up from the most-recent checkpoint.

Cheers, Chris

Kadakol commented 3 years ago

Hi Chris,

Thank you for your response!

Reproducing the devmap experiments

We did a big refactor of the codebase before making this repo public and haven't yet got around to fully re-implementing the scripts for devmap/classifyapp. There is an open pull request that adds the missing code here #107, but unfortunately I haven't got around to fully testing it yet so it may be a bit before it is ready to merge. Feel free to give it a go though! :)

Understood. Let me have a look at the branch feature/classifyapp_81 then!

Training stuck

That looks like a bug to me. I would suspect that the iterator that produces training graphs has crashed, causing the training loop to hand, though I would need more details to reproduce it locally.

As a workaround, you can Ctrl + c to terminate training at any point, then re-run the command with --restore_from=/path/to/ProGraML/datasets/logs/programl/reachability/20:11:11T10:17:08 to pick up from the most-recent checkpoint.

I tried this out, but it doesn't seem to be resuming. This looks like the same issue reported in #140. I'll track the discussion on #140 to understand how to get this working.

Kadakol commented 3 years ago

Hi Chris,

I've started off with feature/classifyapp_81 but there seem to be some issues.

  1. I had to pip install a bunch of packages as and when the errors popped up:
    1. docopt
    2. torch_geometric
    3. torch_sparse
  2. This caused another error - RuntimeError: Detected that PyTorch and torch_sparse were compiled with different CUDA versions. PyTorch has CUDA version 10.2 and torch_sparse has CUDA version 10.0. Please reinstall the torch_sparse that match es your PyTorch install. So I had to fix that using pip install torch-geometric torch-sparse==latest+cu101 torch-scatter==latest+cu101 -f https://pytorch-geometric.com/whl/torch-1.4.0.html
  3. Which led to another issue - RuntimeError: Expected PyTorch version 1.4 but found version 1.6.. I was able to fix it using conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch
  4. This much was sufficient to solve the import errors.

And then gave rise to a new set of issues:

  1. Running python run.py -h gave the following error Traceback (most recent call last): File "run.py", line 60, in <module> import configs, modeling File "/path/to/ProGraML/programl/task/graph_level_classification/configs.py", line 17, in <module> from .dataset import AblationVocab ImportError: attempted relative import with no known parent package
  2. In order to fix this, I created an __init__.py file in the graph_level_classification directory and removed the . from all the imports starting with . in the files in this directory.
  3. The relative import error was fixed, but a new error has now occurred Traceback (most recent call last): File "run.py", line 61, in <module> from configs import ( File "/path/to/ProGraML/programl/task/graph_level_classification/configs.py", line 18, in <module> from dataset import AblationVocab File "/path/to/ProGraML/programl/task/graph_level_classification/dataset.py", line 19, in <module> from programl.proto.program_graph_pb2 import ProgramGraph ModuleNotFoundError: No module named 'programl.proto.program_graph_pb2'

I wasn't sure how to solve this. So I decided to take inspiration from the way the dataflow experiments are setup, i.e., use bazel (I'm a bazel noob, btw). I created a BUILD file, filled it up, and commented out the 3 assert statements in ProGraML/programl/task/graph_level_classification/dataset.py. Finally, the command bazel run //programl/task/graph_level_classification:run -- -h gave me the correct output.

Now I will proceed with trying to run the devmap experiments.

All of these seem to be environment related issues. Is my environment not set up properly? I assumed since I was able to trigger the reachability test that that should not be the case. I understand this branch is still work in progress (or it's possible that I've missed it) but maybe this particular test could benefit from some more documentation. I hope you don't mind my feedback.

Kadakol commented 3 years ago

I had to make a few more changes. It looks the directory structure has changed. The main changes are as follows:

In dataset.py, I had to change the following:

-PROGRAML_VOCABULARY = REPO_ROOT / "deeplearning/ml4pl/poj104/programl_vocabulary.csv"                                                                
-CDFG_VOCABULARY = REPO_ROOT / "deeplearning/ml4pl/poj104/cdfg_vocabulary.csv"                                                                        
+PROGRAML_VOCABULARY = REPO_ROOT / "datasets/vocab/programl.csv"                                                                                      
+CDFG_VOCABULARY = REPO_ROOT / "datasets/vocab/cdfg.csv"  

In run.py, I changed the following:

-    "devmap_amd": (DevmapDataset, "deeplearning/ml4pl/poj104/devmap_data"),                                                                          
-    "devmap_nvidia": (DevmapDataset, "deeplearning/ml4pl/poj104/devmap_data"),                                                                       
+    "devmap_amd": (DevmapDataset, "datasets/dataflow"), 
+    "devmap_nvidia": (DevmapDataset, "datasets/dataflow"), 

And now I've got the following error:

Number of trainable params in GGNNModel: 4,770,238 params, weights size: ~19.0MB.
Traceback (most recent call last):
  File "/home/akshay/.cache/bazel/_bazel_akshay/68fc6203059f9070babbb4937a38c157/execroot/programl/bazel-out/k8-fastbuild/bin/programl/task/graph_level_classification/run.runfiles/programl/programl/task/graph_level_classification/run.py", line 1063, in <module>
    current_kfold_split=split,
  File "/home/akshay/.cache/bazel/_bazel_akshay/68fc6203059f9070babbb4937a38c157/execroot/programl/bazel-out/k8-fastbuild/bin/programl/task/graph_level_classification/run.runfiles/programl/programl/task/graph_level_classification/run.py", line 205, in __init__
    self.load_data(dataset, args["--kfold"], current_kfold_split)
  File "/home/akshay/.cache/bazel/_bazel_akshay/68fc6203059f9070babbb4937a38c157/execroot/programl/bazel-out/k8-fastbuild/bin/programl/task/graph_level_classification/run.runfiles/programl/programl/task/graph_le
vel_classification/run.py", line 331, in load_data
    current_kfold_split
  File "/path/to/ProGraML/programl/task/graph_level_classification/dataset.py", line 1009, in return_cross_validation_splits
    train_data = self.__indexing__(train_index)
AttributeError: 'DevmapDataset' object has no attribute '__indexing__'

The command that I'm using is bazel run //programl/task/graph_level_classification:run -- --model ggnn_devmap --dataset devmap_nvidia --kfold --config_json="{'cdfg': False}". I tried to look at the documentation but could not find the attribute '__indexing__' for the class InMemoryDataset. So I'm pretty much stuck here.

ChrisCummins commented 3 years ago

Hi @Kadakol,

You're right. There was a bug in my data loader which could cause training to forever, same as #140. I have implemented a fix and merged into development.

Thank you very much for your investigation of #107! That is all very useful feedback. Sorry for sending you down a rabbit hole - I will be taking a look at that PR and will let you know once it is passing the tests and ready to merge. I'm taking some much-needed time off this week but I will keep you posted 👍

Merging this issue into #81, please subscribe there for future updates.

Cheers, Chris

Kadakol commented 3 years ago

Hi Chris,

Thanks a lot for fixing #140. I'll take a look at it!

No worries. Glad I could be of some minor help! Hope you have a good break :)

I'll keep an eye on #81. Thank you!