Open Bytes-Explorer opened 4 months ago
To develop Repo-Level Ordering transform for data-prep-kit, it is seen that we require the following approach.
store
store
in the above line represents a data structure similar to key/value store: [str, List[str]]store
. We need to iterate on the keys of the
store and for each repo/key we need to read the files corresponding to that repo-key, filter by
key.repo and save only those rows into output.So above algo has two stages:
Our store
in this specific use case has only writes
in Stage 1
,
And only reading
in Stage 2
So, We can have 3 approaches to implement the store
.
For processing large data as in our case, we can go for 2nd approach.
@shivdeep-singh-ibm I agree the option i is constrained by the Ray object store memory & option iii introduces new service requirement which is associated with service setup & etc. IMHO option ii is best solution for this multi stage processing though we may see multiple read & writes to external storage
I am sorry, I am missing something here. What exactly are we trying to produce here?
@blublinsky There is a code transform requirement which runs a groupby
on data with respect to repo_name
column and then runs a sorting_algorithm
( semantic_sort
or sort_by_filename
) on the grouped data. It writes output data into 1 parquet per repo.
So basically it creates a single arrow table per repository, right? And how does it know the repository name? is it a separate column? Final question. Should it be part of code2parquet
So basically it creates a single arrow table per repository, right? And how does it know the repository name? is it a separate column? Final question. Should it be part of code2parquet
Yes, It needs a repository name, repo_name. It is expected to be in the data. As of now this feature is not in code2parquet, but I think it should somehow come from it.
The other issues linked to this development are :
Search before asking
Component
Transforms/Other
Feature
Add a new module to organise code files using information from structure of a repo
Are you willing to submit a PR?