ChrisCummins / phd

šŸ‘Øā€šŸ’» My PhD.
http://chriscummins.cc/
185 stars 34 forks source link

Add a tutorial for porting DL program generation to other languages #30

Open 50417 opened 6 years ago

50417 commented 6 years ago

In the paper, you mentioned that your RNN can easily be ported to any other programming language. We are trying to validate that claim . Can you provide tutorial if it is possible to just use the RNN model in the code base?

ChrisCummins commented 6 years ago

Hi Sohil, sure! I was actually working on a doc file to do exactly that. Unfortunately though I am taking the next three months off my PhD and won't be working on it until I'm back. In the mean time you're welcome to poke around in the code to do it yourself. The process is really quite straightforward:

  1. Create a corpus for a program language using the code in //datasets/github/scrape_repos.
  2. Create a CLgen model to train on the new corpus (see this file for an example corpus).
  3. Train and sample the model, which is as simple as: $ blaze run //deeplearning/clgen -- --config=/path/to/the/config/file.

Let me know how you get on!

Cheers, Chris

50417 commented 6 years ago

Thank You for the quick reply. I will try to experiment with the code tomorrow. Let you know if I have any questions or concern.

ChrisCummins commented 6 years ago

No worries :) I'll actually keep this issue open as a reminder to myself, and in case anyone else wants something similar.

Cheers, Chris

50417 commented 6 years ago

Hello Chris, I have been trying to run CLgen on macOS. After debugging for some time, i still am unable to debug it to train the test corpus. I get following error.

`clgen.py 176 ERROR invalid literal for int() with base 10: '' (ValueError)= stacktrace:

1 /private/var/tmp/_bazel_sohilshrestha/f8ee67cead6a3e5516303f5e0dd3d4e7/sandbox/darwin-sandbox/20/execroot/phd/bazel-out/darwin-py3-opt/bin/deeplearning/clgen/clgen_test.runfiles/phd/deeplearning/clgen/corpuses/corpuses.py:112 init()

2 /private/var/tmp/_bazel_sohilshrestha/f8ee67cead6a3e5516303f5e0dd3d4e7/sandbox/darwin-sandbox/20/execroot/phd/bazel-out/darwin-py3-opt/bin/deeplearning/clgen/clgen_test.runfiles/phd/deeplearning/clgen/models/models.py:66 init()

3 /private/var/tmp/_bazel_sohilshrestha/f8ee67cead6a3e5516303f5e0dd3d4e7/sandbox/darwin-sandbox/20/execroot/phd/bazel-out/darwin-py3-opt/bin/deeplearning/clgen/clgen_test.runfiles/phd/deeplearning/clgen/clgen.py:100 init()

4 /private/var/tmp/_bazel_sohilshrestha/f8ee67cead6a3e5516303f5e0dd3d4e7/sandbox/darwin-sandbox/20/execroot/phd/bazel-out/darwin-py3-opt/bin/deeplearning/clgen/clgen_test.runfiles/phd/deeplearning/clgen/clgen.py:244 DoFlagsAction()

5 /private/var/tmp/_bazel_sohilshrestha/f8ee67cead6a3e5516303f5e0dd3d4e7/sandbox/darwin-sandbox/20/execroot/phd/bazel-out/darwin-py3-opt/bin/deeplearning/clgen/clgen_test.runfiles/phd/deeplearning/clgen/clgen.py:205 RunContext()`

I ran the recommended test for clgen. 10 out of 21 passed. To run it on macOS, I have created a virtualenv and ran the code there. The python version used was 3.6.5 I see issue with bazel coming across the https://github.com/tensorflow/tensorflow/issues/10436. The issue encountered was

ERROR: /private/var/tmp/_bazel_sohilshrestha/f8ee67cead6a3e5516303f5e0dd3d4e7/external/base/image/BUILD:6:1: Couldn't build file external/base/image/002.tar.gz.nogz.sha256: SHA256 external/base/image/002.tar.gz.nogz.sha256 failed (Exit 1): sha256 failed: error executing command (cd /private/var/tmp/_bazel_sohilshrestha/f8ee67cead6a3e5516303f5e0dd3d4e7/execroot/phd && \ exec env - \ bazel-out/host/bin/external/bazel_tools/tools/build_defs/hash/sha256 bazel-out/darwin-opt/bin/external/base/image/002.tar.gz.nogz bazel-out/darwin-opt/bin/external/base/image/002.tar.gz.nogz.sha256) Use --sandbox_debug to see verbose messages from the sandbox Traceback (most recent call last): File "bazel-out/host/bin/external/bazel_tools/tools/build_defs/hash/sha256", line 203, in <module> Main() File "bazel-out/host/bin/external/bazel_tools/tools/build_defs/hash/sha256", line 176, in Main raise AssertionError('Could not find python binary: ' + PYTHON_BINARY) AssertionError: Could not find python binary: python3.6

There are few other error as well `path = PosixPath('/var/folders/68/_gxs799d0930bmq_170px4fw0000gn/T/clgen_abc_corpus_ghp3hxfo')

def GetDirectoryMTime(path: pathlib.Path) -> int:
  """Get the timestamp of the most recently modified file/dir in directory.

  Recursively checks subdirectory contents. This requires that the directory
  exists and is not empty.

  Params:
    abspath: The absolute path to the directory.

  Returns:
    The seconds since epoch of the last modification.
  """
  # Pure python implementation.
  # return int(max(
  #     max(os.path.getmtime(os.path.join(root, file)) for file in files) for
  #     root, _, files in os.walk(path)))
  # Faster implementation using UNIX tools. Requires GNU xargs, which supports
  # the '-d' argument, which is needed to support file names with spaces. On
  # macOS, this means having the homebrew findutils package installed, and
  # the following directory in your PATH:
  #    /usr/local/opt/findutils/libexec/gnubin
  output = subprocess.check_output(
      f"find '{path}' -type f | xargs -d'\n' stat -c '%Y:%n' | sort -t: -n | "
      "tail -1 | cut -d: -f1", universal_newlines=True, shell=True)
return int(output)E     ValueError: invalid literal for int() with base 10: ''`
ChrisCummins commented 6 years ago

Hi there, sorry Iā€™m writing this on my ipad so canā€™t test the fix - but I think I see what the problem is. If you find the file which contains the function ā€˜def GetDirectoryMTime(ā€™, youā€™ll see in the comment ā€˜Pure python implementationā€™, and then return int(max(.... If you uncomment that return statement, it should fix the error.

ChrisCummins commented 6 years ago

The problem is that Iā€™ve hardcoded a reference to GNU xargs command, and macOS ships with a BSD implementation. Iā€™ll fix up the docs / code to work around this. Thanks for reporting the issue!

JiajieZhang-Georgia commented 5 years ago

Hi Chris and @50417 , I tried to run the code for creating a corpus for a language, when I run the first code, bazel run //datasets/github/scrape_repos/scraper --clone_list $PWD/clone_list.pbtxt it gave me an error about 'ERROR: Unrecognized option: --clone_list' Do you know how to solve it?

ChrisCummins commented 5 years ago

Hi @JiajieZhang-Georgia , woops I'm sorry, I missed a -- in the README. The command is:

bazel run //datasets/github/scrape_repos:scraper -- --clone_list $PWD/clone_list.pbtxt
50417 commented 5 years ago

HI @ChrisCummins ,

Are there any updates on this ?

Is it possible to port CLgen to any other OS environments like Windows or other dialects of Linux. ?

ChrisCummins commented 5 years ago

Hey @50417, thanks for your patience! :-) I can see you've made good progress on adapting it to Simulink. If you're looking for specific help with your project I may be able to help out - I would also be interested in getting your work upstream. If you're interested in collaborating, shoot me an email at chrisc.101@gmail.com

Cheers, Chris

50417 commented 5 years ago

Hello everyone, I have created a bare minimum CLgen using basic python script(without need for bazel) here. Let me know if there are any issues and can this issue be closed .

ChrisCummins commented 5 years ago

Interesting! What, in your experience, is the biggest issue for using this project that your fork overcomes?

50417 commented 5 years ago

The biggest issue was I had to rebuilt all of your projects in the phd project. Although learning bazel had a bit of a learning curve, it does not officially support Python and the fact that it is still in beta was an issue when there were bugs .