Preprocessing of datasets

Hello!

Thanks for your works on InferCode, it's awesome! My name is Maksim Zubkov, and I am doing my bachelor thesis at JetBrains Research on the topic of self-supervised learning techniques on source code. I want to compare the pre-training scheme proposed in your paper with one I investigate in the scope of my research.

I tried to initialize CodeClassificationData to train the model on my date, but I could not find a script to create files with a .pkl extension. Now it seems like I was finally able to run preprocessing. In order to achieve this goal, I followed the following steps:

As suggested in the README, I execute: docker run --rm -v $(pwd):/data -w /data --entrypoint /usr/local/bin/subtree -it yijun/fast examples/raw_code examples/subtrees node_types.csv to create .ids.csv files in examples/subtrees
Then I explored yijun/fast docker image and found binaries /usr/local/bin/pkl. I ran docker with /usr/local/bin/pkl as an entry point which resulted in several .pkl files.
Then I added minor changes to your repo, namely add some __init__.py files
The next step was to deal with the fast_pb2.py file, which I simply copied from graph-ast repo

Finally, I have succeeded to create trees object and run put_trees_into_bucket, but could you please answer several questions:

Is this a correct algorithm to prepare data for your model? If so, I can create a pull request and add all this information to the README? Or maybe I missed some important point?
I didn't got the difference between /usr/local/bin/pkl and /usr/local/bin/pklpos, could you please explain what is the difference?

If I can somehow help you with open-sourcing the code base of InferCode, I will be pleased to help you, if it is possible

bdqnghi / infercode

Preprocessing of datasets #2