IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
307 stars 134 forks source link

Restructure Html2Parquet with its own dpk_ namespace #809

Open touma-I opened 4 days ago

touma-I commented 4 days ago

Why are these changes needed?

This is a first of a series of restructuring changes that are done to have each transform built as its own module (e.g. dpk_html2parquet) with a ray submodule (dpk_html2parquet.ray ).

Related issue number (if any).

https://github.com/IBM/data-prep-kit/issues/774

roytman commented 3 days ago

if each transformer builds its own module, should we add init.py files and create a unified namespace? For example: instead from dpk_html2parquet.transform import Html2ParquetTransformConfiguration use from dpk_html2parquet import Html2ParquetTransformConfiguration

the same for the ray runtime.

touma-I commented 3 days ago

If we are doing so massive refactoring, should we combine test and test-data into one dir, e.g.

  • test

    • data

    • input

    • expected

@roytman Why not leave it to the transform owner developer to decide if they want to nest the test-data under test. All we care about that we have a test folder for running the pytest. no ? where the developer puts their data is up to them. no ?

touma-I commented 3 days ago

if each transformer builds its own module, should we add init.py files and create a unified namespace? For example: instead from dpk_html2parquet.transform import Html2ParquetTransformConfiguration use from dpk_html2parquet import Html2ParquetTransformConfiguration

the same for the ray runtime.

How did I miss that? Done. Thanks @roytman