IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
283 stars 128 forks source link

[Feature] add an example of html2pq in the documentation #788

Open sujee opened 6 days ago

sujee commented 6 days ago

Search before asking

Component

Tools/ingest2parquet

Feature

https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/README.md

shows input HTML and output MD. But doesn't have a sample code :smile:

We should provide sample code

Are you willing to submit a PR?

Bytes-Explorer commented 6 days ago

Sample code is available here https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/src/html2parquet_local.py

However, there seems to be some issue with it

@sungeunan-ibm When i try to run the test files, I get the below error. Can you pls check?

(data-prep-kit-html) himapatel@Himas-MacBook-Pro-2 src % python html2parquet_local.py
Traceback (most recent call last):
 File "/Users/himapatel/Work/Projects/MCD/OpenSource/html-test/data-prep-kit/transforms/language/html2parquet/python/src/html2parquet_local.py", line 15, in <module>
  from data_processing.data_access import DataAccessLocal
ModuleNotFoundError: No module named 'data_processing'
Bytes-Explorer commented 6 days ago

Please also see the issue in this notebook https://github.com/sujee/data-prep-kit/blob/html-processing-1/examples/notebooks/html-processing/2_process_html_python.ipynb

touma-I commented 6 days ago

Sample code is available here https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/src/html2parquet_local.py

However, there seems to be some issue with it

@sungeunan-ibm When i try to run the test files, I get the below error. Can you pls check?

(data-prep-kit-html) himapatel@Himas-MacBook-Pro-2 src % python html2parquet_local.py
Traceback (most recent call last):
 File "/Users/himapatel/Work/Projects/MCD/OpenSource/html-test/data-prep-kit/transforms/language/html2parquet/python/src/html2parquet_local.py", line 15, in <module>
  from data_processing.data_access import DataAccessLocal
ModuleNotFoundError: No module named 'data_processing'

@Bytes-Explorer I does not look like you setup the environment properly:

cd transforms/language/html2parquet/python
make venv
source venv/bin/activate
python src//html2parquet_local.py
Bytes-Explorer commented 6 days ago

Yes, right. I will try again after building the environment.

This is another reason why we should simplify and everything should happen out of pip install. I am glad we are on that journey.

touma-I commented 6 days ago

@sujee you should be able to use this transform very much like you the pdf2parquet. The only caveat is that they cannot be installed together: either pdf2parquet or html2parquet can be installed in your environment.

from data_processing.runtime.pure_python import PythonTransformLauncher
from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration

launcher = PythonTransformLauncher(Html2ParquetPythonTransformConfiguration())
launcher.launch()
daw3rd commented 6 days ago

We seem to be mixing use cases here.

  1. Notebook-based user of a transform
  2. transforms/html2parquet-based developer of a transform.

For 1, agreed pip install should be used. For 2, make venv should be used.

touma-I commented 6 days ago

Yes, right. I will try again after building the environment.

This is another reason why we should simplify and everything should happen out of pip install. I am glad we are on that journey.

You can use pip install if you want. I was simply responding to your comment as you seem to be trying to run the test example from the test folder

touma-I commented 6 days ago

@sujee @Bytes-Explorer changing this from Bug to Documentation. @shahrokh Where are we capturing the documentation for something like this ? I would make sense to have it in the Readme.md for the transform ?

touma-I commented 6 days ago

Search before asking

  • [x] I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

Feature

https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/README.md

shows input HTML and output MD. But doesn't have a sample code 😄

We should provide sample code

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

Agree, @sungeunan-ibm could you please update your Readme.md to have an example showing how a notebook user would use your transform ? Please reach out if you need help with this. Thanks

matouma commented 21 hours ago

@sujee Can you attach some sample html to this issue ? Just one or two html files.

sujee commented 21 hours ago

@touma-I this is not tied any particular html input.

Just need a sample python code to transform HTML --> MD.

matouma commented 21 hours ago

@sujee I understand. I just need any html

sujee commented 21 hours ago

Here is a sample html

ai-alliance-index.html.txt

(I had to add .txt extension to html file, so I can attach here)

touma-I commented 20 hours ago

@sujee see here. https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb

touma-I commented 20 hours ago

@sungeunan-ibm When you are back, please see how I did it and let me know if we need to change anything. https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb cc: @shahrokhDaijavad

sujee commented 13 hours ago

Just add a link to this notebook in html2pq README so it's linked.

Great work @touma-I :clap: