Transform for Code Profiling

pankajskku commented 1 month ago

This tranform extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-lnaguage data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form.

Why are these changes needed?

Data profiling, in the context of machine learning, is the process of examining and analyzing data to create useful statistics. These statistics are used both as an aid for better comprehension of the properties of data as well as for a variety of downstream data processing tasks such as data valuation (assessing the value of data relative to the business objectives at hand) and data curation (filtering and prioritizing training data based on derived thresholds). In the Large Language Model (LLM) setting, training data is typically unstructured in nature comprising natural language text, images, and code. In this work, we specifically focus on code-LLMs, where the quality of code training data substantially affects the model accuracy of LLM-based coding tasks such as code generation and summarization. Therefore, having the capabilities to characterize code data in terms of programming language concepts aids in both deriving insights related to code training/evaluation data and in the downstream curation of code training data. In this work, we address the problem of profiling multi-lingual code datasets by extracting an extensible user-defined set of syntactic concepts over arbitrary programming languages.

Related issue number (if any).

pankajskku commented 1 month ago

You are checking in 100mb in the input directory. This almost doubles the size of the clone repository. Can we get these from somewhere else and not stored them in the repo and, presumably the published wheel containing the transform?
syntactic_concept_extractor$ du -sm *
1 Makefile
1 README.md
91    input
5 python
1 ray

@daw3rd I have removed the static .so from the PR and made sure that the tree-sitter-bindings folder gets cloned from a git repository while creating venv.

daw3rd commented 1 month ago

You are checking in 100mb in the input directory. This almost doubles the size of the clone repository. Can we get these from somewhere else and not stored them in the repo and, presumably the published wheel containing the transform?
syntactic_concept_extractor$ du -sm *
1   Makefile
1   README.md
91  input
5   python
1   ray
@daw3rd I have removed the static .so from the PR and made sure that the tree-sitter-bindings folder gets cloned from a git repository while creating venv.

I still see them in the PR under input/tree-sitter-bindings. And, if they are removed how will this work if pip installed from pypi? Can they be downloaded at runtime.

pankajskku commented 1 month ago

You are checking in 100mb in the input directory. This almost doubles the size of the clone repository. Can we get these from somewhere else and not stored them in the repo and, presumably the published wheel containing the transform?
syntactic_concept_extractor$ du -sm *
1 Makefile
1 README.md
91    input
5 python
1 ray
@daw3rd I have removed the static .so from the PR and made sure that the tree-sitter-bindings folder gets cloned from a git repository while creating venv.
I still see them in the PR under input/tree-sitter-bindings. And, if they are removed how will this work if pip installed from pypi? Can they be downloaded at runtime.

@daw3rd I'm sorry, I cleaned them now. I decided to change the approach and rely on adding the utility code in the transform itself to clone the .so's during the transform runtime. This way, it could work at the runtime. Please let me know what you think.

pankajskku commented 1 month ago

@daw3rd Please let me know your opinion on the updated PR.

IBM / data-prep-kit

Transform for Code Profiling #646

Why are these changes needed?

Related issue number (if any).