Open sujee opened 2 months ago
I might be misunderstanding something, but if the request is to include badword file into pypi package, it sounds weird to me. Since badword file is the file that user of doc_quality should prepare, it sounds natural to me that user specifies the path to badword file in their project directory.
no need to publish the 'bad word files' to pypi. But can we give a url to a accessible badwords file (we can point to our example from github (https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality/ray/ldnoobw) or any other open source ones)? So user can download it and use it locally?
@sujee I am not sure whether you are just asking a question or if you want @dtsuzuku-ibm to make any changes in his code. If it is the former, e.g., you want to use this transform in a Colab notebook and you have no access to the local directory, you can specify the filepath as a parameter and use what we have in the ldnoobw directory of our repo. The files in this directory are all publicly available, i.e., they are open source. Downloading them from our repo or other open-source URLs doesn't make a difference. If you are suggesting a code change, can you be more specific? Thanks.
no need to publish the 'bad word files' to pypi. But can we give a url to a accessible badwords file (we can point to our example from github (https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality/ray/ldnoobw) or any other open source ones)? So user can download it and use it locally?
no code change necessary, just to be clear :-)
I will work on an example showcasing:
for (1) are there published 'bad words files' we can access?
Search before asking
Component
Transforms/Other
What happened + What you expected to happen
The current sample code looks for bad_word_filepath in project directory (assuming this is run from source tree).
Currently this file is in :
transforms/language/doc_quality/ray/ldnoobw/en/
We need an example showing how to use this using PYPI packages.
I have the following packages installed
Reproduction script
https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb
Step 7
Anything else
No response
OS
Ubuntu
Python
3.11.x
Are you willing to submit a PR?