Closed Bytes-Explorer closed 5 days ago
I believe this is satisified with the html2parquet transform. https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/html2parquet. Although I wonder if it should be moved to language?
@daw3rd The Python version of this has been merged, and we can close the issue. What is being done now is the Ray version that @sungeunan-ibm is working on. As for moving it from universal to language, let's do that after it is finished.
@shahrokhDaijavad @daw3rd I would close this issue and open a new one for the Ray version. I agree with David that this should be moved to language folder
Search before asking
Component
Tools/ingest2parquet
Feature
We would like to add ability to read HTML files and convert them to parquet files, which can go through other processing modules like dedup, filtering etc.
Library that can be used https://trafilatura.readthedocs.io/en/latest/
Are you willing to submit a PR?