IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
270 stars 125 forks source link

[Feature] need an example of using doc_quality plugin with installed pypi packages #575

Open sujee opened 2 months ago

sujee commented 2 months ago

Search before asking

Component

Transforms/Other

What happened + What you expected to happen

The current sample code looks for bad_word_filepath in project directory (assuming this is run from source tree).

Currently this file is in : transforms/language/doc_quality/ray/ldnoobw/en/

We need an example showing how to use this using PYPI packages.

doc_quality_basedir = os.path.join(rootdir, "transforms", "language", "doc_quality", "python")
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_creation_delay": 0,
    # doc quality configuration
    text_lang_cli_param: "en",
    doc_content_column_cli_param: "contents",
    bad_word_filepath_cli_param: os.path.join(doc_quality_basedir, "ldnoobw", "en"),
}

I have the following packages installed

data_prep_toolkit                0.2.1.dev2
data_prep_toolkit_ray            0.2.1.dev2
data_prep_toolkit_transforms     0.2.1.dev2
data_prep_toolkit_transforms_ray 0.2.1.dev2

Reproduction script

https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb

Step 7

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

dtsuzuku-ibm commented 1 month ago

I might be misunderstanding something, but if the request is to include badword file into pypi package, it sounds weird to me. Since badword file is the file that user of doc_quality should prepare, it sounds natural to me that user specifies the path to badword file in their project directory.

sujee commented 2 weeks ago

no need to publish the 'bad word files' to pypi. But can we give a url to a accessible badwords file (we can point to our example from github (https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality/ray/ldnoobw) or any other open source ones)? So user can download it and use it locally?

shahrokhDaijavad commented 1 week ago

@sujee I am not sure whether you are just asking a question or if you want @dtsuzuku-ibm to make any changes in his code. If it is the former, e.g., you want to use this transform in a Colab notebook and you have no access to the local directory, you can specify the filepath as a parameter and use what we have in the ldnoobw directory of our repo. The files in this directory are all publicly available, i.e., they are open source. Downloading them from our repo or other open-source URLs doesn't make a difference. If you are suggesting a code change, can you be more specific? Thanks.

no need to publish the 'bad word files' to pypi. But can we give a url to a accessible badwords file (we can point to our example from github (https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality/ray/ldnoobw) or any other open source ones)? So user can download it and use it locally?

sujee commented 1 week ago

no code change necessary, just to be clear :-)

I will work on an example showcasing:

  1. downloading the bad-words files from a location (could be ours or any other sources)
  2. using it with the transform.

for (1) are there published 'bad words files' we can access?