[Feature] Add support to process HTML file format

IBM / data-prep-kit

Open source project for data preparation of LLM application builders

https://ibm.github.io/data-prep-kit/

Apache License 2.0

268 stars 125 forks source link

[Feature] Add support to process HTML file format #161

Closed Bytes-Explorer closed 5 days ago

Bytes-Explorer commented 5 months ago

Search before asking

[X] I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

Feature

We would like to add ability to read HTML files and convert them to parquet files, which can go through other processing modules like dedup, filtering etc.

Library that can be used https://trafilatura.readthedocs.io/en/latest/

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

daw3rd commented 1 month ago

I believe this is satisified with the html2parquet transform. https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/html2parquet. Although I wonder if it should be moved to language?

shahrokhDaijavad commented 1 month ago

@daw3rd The Python version of this has been merged, and we can close the issue. What is being done now is the Ray version that @sungeunan-ibm is working on. As for moving it from universal to language, let's do that after it is finished.

Bytes-Explorer commented 1 month ago

@shahrokhDaijavad @daw3rd I would close this issue and open a new one for the Ray version. I agree with David that this should be moved to language folder