[Feature] Add functionality to organise files by repository structure

Bytes-Explorer commented 4 months ago

Search before asking

[X] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

Add a new module to organise code files using information from structure of a repo

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

shivdeep-singh-ibm commented 4 months ago

To develop Repo-Level Ordering transform for data-prep-kit, it is seen that we require the following approach.

iterate over all files, run groupby and make a list of all repos.
update the reponame and list of files where it is found to a store
store in the above line represents a data structure similar to key/value store: [str, List[str]]
After processing all files and populating the store. We need to iterate on the keys of the store and for each repo/key we need to read the files corresponding to that repo-key, filter by key.repo and save only those rows into output.

So above algo has two stages:

Stage 1: populating files per repo to a store.
Stage 2:cread list of files from store and filter according to repo.

Our store in this specific use case has only writes in Stage 1, And only reading in Stage 2

So, We can have 3 approaches to implement the store.

1. Using Ray Object store (Multiactor), uses memory of nodes and is contrained by memory available on the cluster, network.
1. Using Filesystem/S3 as backend to store. (folders as keys and files as list of values), constrained by network
1. Using external store, etcd etc

For processing large data as in our case, we can go for 2nd approach.

Param-S commented 4 months ago

@shivdeep-singh-ibm I agree the option i is constrained by the Ray object store memory & option iii introduces new service requirement which is associated with service setup & etc. IMHO option ii is best solution for this multi stage processing though we may see multiple read & writes to external storage

blublinsky commented 4 months ago

I am sorry, I am missing something here. What exactly are we trying to produce here?

shivdeep-singh-ibm commented 4 months ago

@blublinsky There is a code transform requirement which runs a groupby on data with respect to repo_name column and then runs a sorting_algorithm ( semantic_sort or sort_by_filename) on the grouped data. It writes output data into 1 parquet per repo.

blublinsky commented 4 months ago

So basically it creates a single arrow table per repository, right? And how does it know the repository name? is it a separate column? Final question. Should it be part of code2parquet

shivdeep-singh-ibm commented 3 months ago

So basically it creates a single arrow table per repository, right? And how does it know the repository name? is it a separate column? Final question. Should it be part of code2parquet

Yes, It needs a repository name, repo_name. It is expected to be in the data. As of now this feature is not in code2parquet, but I think it should somehow come from it.

shivdeep-singh-ibm commented 3 months ago

The other issues linked to this development are :

IBM / data-prep-kit