Thanks for your works on InferCode, it's awesome!
My name is Maksim Zubkov, and I am doing my bachelor thesis at JetBrains Research on the topic of self-supervised learning techniques on source code. I want to compare the pre-training scheme proposed in your paper with one I investigate in the scope of my research.
I tried to initialize CodeClassificationData to train the model on my date, but I could not find a script to create files with a .pkl extension. Now it seems like I was finally able to run preprocessing. In order to achieve this goal, I followed the following steps:
As suggested in the README, I execute: docker run --rm -v $(pwd):/data -w /data --entrypoint /usr/local/bin/subtree -it yijun/fast examples/raw_code examples/subtrees node_types.csv to create .ids.csv files in examples/subtrees
Then I explored yijun/fast docker image and found binaries /usr/local/bin/pkl. I ran docker with /usr/local/bin/pkl as an entry point which resulted in several .pkl files.
Then I added minor changes to your repo, namely add some __init__.py files
The next step was to deal with the fast_pb2.py file, which I simply copied from graph-ast repo
Finally, I have succeeded to create trees object and run put_trees_into_bucket, but could you please answer several questions:
Is this a correct algorithm to prepare data for your model? If so, I can create a pull request and add all this information to the README? Or maybe I missed some important point?
I didn't got the difference between /usr/local/bin/pkl and /usr/local/bin/pklpos, could you please explain what is the difference?
If I can somehow help you with open-sourcing the code base of InferCode, I will be pleased to help you, if it is possible
Hello!
Thanks for your works on
InferCode
, it's awesome! My name is Maksim Zubkov, and I am doing my bachelor thesis at JetBrains Research on the topic of self-supervised learning techniques on source code. I want to compare the pre-training scheme proposed in your paper with one I investigate in the scope of my research.I tried to initialize
CodeClassificationData
to train the model on my date, but I could not find a script to create files with a.pkl
extension. Now it seems like I was finally able to run preprocessing. In order to achieve this goal, I followed the following steps:README
, I execute:docker run --rm -v $(pwd):/data -w /data --entrypoint /usr/local/bin/subtree -it yijun/fast examples/raw_code examples/subtrees node_types.csv
to create.ids.csv
files inexamples/subtrees
yijun/fast
docker image and found binaries/usr/local/bin/pkl
. I ran docker with/usr/local/bin/pkl
as an entry point which resulted in several.pkl
files.__init__.py
filesfast_pb2.py
file, which I simply copied from graph-ast repoFinally, I have succeeded to create
trees
object and runput_trees_into_bucket
, but could you please answer several questions:README
? Or maybe I missed some important point?/usr/local/bin/pkl
and/usr/local/bin/pklpos
, could you please explain what is the difference?If I can somehow help you with open-sourcing the code base of
InferCode
, I will be pleased to help you, if it is possible