IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
241 stars 122 forks source link

[Feature] Add functionality to organise files by repository structure #283

Open Bytes-Explorer opened 4 months ago

Bytes-Explorer commented 4 months ago

Search before asking

Component

Transforms/Other

Feature

Add a new module to organise code files using information from structure of a repo

Are you willing to submit a PR?

shivdeep-singh-ibm commented 4 months ago

To develop Repo-Level Ordering transform for data-prep-kit, it is seen that we require the following approach.

So above algo has two stages:

Our store in this specific use case has only writes in Stage 1, And only reading in Stage 2

So, We can have 3 approaches to implement the store.

For processing large data as in our case, we can go for 2nd approach.

Param-S commented 4 months ago

@shivdeep-singh-ibm I agree the option i is constrained by the Ray object store memory & option iii introduces new service requirement which is associated with service setup & etc. IMHO option ii is best solution for this multi stage processing though we may see multiple read & writes to external storage

blublinsky commented 4 months ago

I am sorry, I am missing something here. What exactly are we trying to produce here?

shivdeep-singh-ibm commented 4 months ago

@blublinsky There is a code transform requirement which runs a groupby on data with respect to repo_name column and then runs a sorting_algorithm ( semantic_sort or sort_by_filename) on the grouped data. It writes output data into 1 parquet per repo.

blublinsky commented 4 months ago

So basically it creates a single arrow table per repository, right? And how does it know the repository name? is it a separate column? Final question. Should it be part of code2parquet

shivdeep-singh-ibm commented 3 months ago

So basically it creates a single arrow table per repository, right? And how does it know the repository name? is it a separate column? Final question. Should it be part of code2parquet

Yes, It needs a repository name, repo_name. It is expected to be in the data. As of now this feature is not in code2parquet, but I think it should somehow come from it.

shivdeep-singh-ibm commented 3 months ago

The other issues linked to this development are :