Open sujee opened 6 days ago
Sample code is available here https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/src/html2parquet_local.py
However, there seems to be some issue with it
@sungeunan-ibm When i try to run the test files, I get the below error. Can you pls check?
(data-prep-kit-html) himapatel@Himas-MacBook-Pro-2 src % python html2parquet_local.py
Traceback (most recent call last):
File "/Users/himapatel/Work/Projects/MCD/OpenSource/html-test/data-prep-kit/transforms/language/html2parquet/python/src/html2parquet_local.py", line 15, in <module>
from data_processing.data_access import DataAccessLocal
ModuleNotFoundError: No module named 'data_processing'
Please also see the issue in this notebook https://github.com/sujee/data-prep-kit/blob/html-processing-1/examples/notebooks/html-processing/2_process_html_python.ipynb
Sample code is available here https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/src/html2parquet_local.py
However, there seems to be some issue with it
@sungeunan-ibm When i try to run the test files, I get the below error. Can you pls check?
(data-prep-kit-html) himapatel@Himas-MacBook-Pro-2 src % python html2parquet_local.py Traceback (most recent call last): File "/Users/himapatel/Work/Projects/MCD/OpenSource/html-test/data-prep-kit/transforms/language/html2parquet/python/src/html2parquet_local.py", line 15, in <module> from data_processing.data_access import DataAccessLocal ModuleNotFoundError: No module named 'data_processing'
@Bytes-Explorer I does not look like you setup the environment properly:
cd transforms/language/html2parquet/python
make venv
source venv/bin/activate
python src//html2parquet_local.py
Yes, right. I will try again after building the environment.
This is another reason why we should simplify and everything should happen out of pip install. I am glad we are on that journey.
@sujee you should be able to use this transform very much like you the pdf2parquet. The only caveat is that they cannot be installed together: either pdf2parquet or html2parquet can be installed in your environment.
from data_processing.runtime.pure_python import PythonTransformLauncher
from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration
launcher = PythonTransformLauncher(Html2ParquetPythonTransformConfiguration())
launcher.launch()
We seem to be mixing use cases here.
For 1, agreed pip install
should be used. For 2, make venv
should be used.
Yes, right. I will try again after building the environment.
This is another reason why we should simplify and everything should happen out of pip install. I am glad we are on that journey.
You can use pip install if you want. I was simply responding to your comment as you seem to be trying to run the test example from the test folder
@sujee @Bytes-Explorer changing this from Bug to Documentation. @shahrokh Where are we capturing the documentation for something like this ? I would make sense to have it in the Readme.md for the transform ?
Search before asking
- [x] I searched the issues and found no similar issues.
Component
Tools/ingest2parquet
Feature
https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/README.md
shows input HTML and output MD. But doesn't have a sample code 😄
We should provide sample code
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
Agree, @sungeunan-ibm could you please update your Readme.md to have an example showing how a notebook user would use your transform ? Please reach out if you need help with this. Thanks
@sujee Can you attach some sample html to this issue ? Just one or two html files.
@touma-I this is not tied any particular html input.
Just need a sample python code to transform HTML --> MD.
@sujee I understand. I just need any html
Here is a sample html
(I had to add .txt extension to html file, so I can attach here)
@sungeunan-ibm When you are back, please see how I did it and let me know if we need to change anything. https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb cc: @shahrokhDaijavad
Just add a link to this notebook in html2pq README so it's linked.
Great work @touma-I :clap:
Search before asking
Component
Tools/ingest2parquet
Feature
https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/README.md
shows input HTML and output MD. But doesn't have a sample code :smile:
We should provide sample code
Are you willing to submit a PR?