Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.24k stars 766 forks source link

feat/Extract images in partition_html #3050

Open jiarongkoh opened 6 months ago

jiarongkoh commented 6 months ago

Is your feature request related to a problem? Please describe. I process HTML files and uses the partition_html function to do so. However, I noticed that this function is capable of extracting Tables as an elements, but not Images.

Describe the solution you'd like I would like partition_html to be able to extract Images, like how shared.PartitionParameters is able to.

Describe alternatives you've considered I have tried parsing the same HTML file into shared.PartitionParameters, but this also do not extract Images. One alternative I explored was to convert the HTML file to PDF. While this might be possible, it is not guaranteed that the conversion will still yield the same expected output.

Additional context nil

MthwRobinson commented 6 months ago

Hi @jiarongkoh - thanks for the issue! We haven't supported image extraction from HTML in the past because images in HTML are linked rather than embedded directly in the document. We'll revisit internally though and follow up.

harshsavasil commented 4 months ago

@MthwRobinson do you plan to implement this feature anytime soon?