Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.49k stars 584 forks source link

Parsing HTML files #3150

Closed vinodhsiyer20 closed 2 weeks ago

vinodhsiyer20 commented 1 month ago

For HTML pages where the pages consists of mostly Javascript tags partition_html returns empty elements. Not able to parse the file. Is there a way to process such pages? Also is there a way to get the Image information also in the metadata or base 64 encoding of the image.

scanny commented 1 month ago

@vinodhsiyer20 please provide a sample HTML file with the characteristics you mention along with an idea of what you would expect to see in the output.

vinodhsiyer20 commented 1 month ago

Hi, For certain pages like https://docs.unstructured.io/api-reference/api-services/document-elements, https://www.dell.com/en-in/work/lp/dt/multicloud-services,https://support.hp.com/us-en/help/computer/windows-operating-system-issues I am able to get elements and metadata. But some pages like https://support.hp.com/us-en/document/c04678145/default.html, https://www.lenovo.com/in/en/ etc... partition_html does not return elements. I am getting a blank list.

Thanks Vinodh S

On Tue, Jun 4, 2024 at 10:44 PM Steve Canny @.***> wrote:

@vinodhsiyer20 https://github.com/vinodhsiyer20 please provide a sample HTML file with the characteristics you mention along with an idea of what you would expect to see in the output.

— Reply to this email directly, view it on GitHub https://github.com/Unstructured-IO/unstructured/issues/3150#issuecomment-2148027166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR67LCSLCL5SRZLJCRYMTJTZFXYWDAVCNFSM6AAAAABIY4RITOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBYGAZDOMJWGY . You are receiving this because you were mentioned.Message ID: @.***>

tbs17 commented 4 weeks ago

@vinodhsiyer20 , thanks for sharing the examples. I'm able to reproduce your error. I'm confirming with the internal team on whether if we have exceptional cases for parsing htmls.

tbs17 commented 4 weeks ago

@vinodhsiyer20 , I got an answer from our internal team.

There's an exception: If a web page generates its content via javascript, it will not be processable by our system.

However, you could do following:

A user can use headless chrome to get a rendering with some javascript run and save it locally: https://developer.chrome.com/blog/headless-chrome/

Example: /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless 'https://www.hsdl.org/c/abstract/?docid=875199' --virtual-time-budget=5000 --dump-dom The --virtual-time-budget=5000 gives the headless browser 5 seconds to load/run stuff before freezing itself and dumping the contents of the headless browser window. If you are savvy in Javascript, you can also use puppeteer to grab stuff from the page: https://pptr.dev/

let page = await browser.newPage();
await await page.goto("https://www.hsdl.org/c/abstract/?docid=875199");
let bodyHTML = await page.evaluate(() => document.body.innerHTML);

with above, you will have to save the HTML to disk and point to the local html file instead of the URL.

Hope this helps!

vinodhsiyer20 commented 3 weeks ago

Hi, Thanks, I'll check it out. I am looking for an automated pipeline. Currently have some alternates with GPT 4o or with agents

Thanks Vinodh S

On Fri, Jun 7, 2024 at 9:51 PM Tracy Shen @.***> wrote:

@vinodhsiyer20 https://github.com/vinodhsiyer20 , I got an answer from our internal team.

There's an exception: If a web page generates its content via javascript, it will not be processable by our system.

However, you could do following:

A user can use headless chrome to get a rendering with some javascript run and save it locally: https://developer.chrome.com/blog/headless-chrome/

Example: /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless ' https://www.hsdl.org/c/abstract/?docid=875199' --virtual-time-budget=5000 --dump-dom The --virtual-time-budget=5000 gives the headless browser 5 seconds to load/run stuff before freezing itself and dumping the contents of the headless browser window. If you are savvy in Javascript, you can also use puppeteer to grab stuff from the page: https://pptr.dev/

let page = await browser.newPage(); await await page.goto("https://www.hsdl.org/c/abstract/?docid=875199"); let bodyHTML = await page.evaluate(() => document.body.innerHTML);

with above, you will have to save the HTML to disk and point to the local html file instead of the URL.

Hope this helps!

— Reply to this email directly, view it on GitHub https://github.com/Unstructured-IO/unstructured/issues/3150#issuecomment-2155153009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR67LCT5OWNCMTNLSDMU4U3ZGHMXZAVCNFSM6AAAAABIY4RITOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJVGE2TGMBQHE . You are receiving this because you were mentioned.Message ID: @.***>