extractus / article-extractor

To extract main article from given URL with Node.js
https://extractor-demos.pages.dev/article-extractor
MIT License
1.46k stars 132 forks source link

Using Playwright or Pupperteer do not work for me with extractFromHtml() #389

Closed onigetoc closed 1 month ago

onigetoc commented 2 months ago

Using Playwright or Pupperteer do not work for me with extractFromHtml()

I tried everything like:

Get the HTML to send to article-extractor const contentHTML = await page.locator('html').innerHTML(); OR const htmlPageContent = await page.content();

For testing where i use Playwright i did try with: const contentExtractHtmlArticle = await articleExtractor(htmlPageContent);

I did a extract.ts file with a export fonction.

I tried this in my export page: const { content } = await extractFromHtml(html); OR const { content } = await extractFromHtml(String(html));

I import my fonction where i use Playwright just to tested if i got a async await error. But if i use from url in my export function, i tryed in the below section with extrac(url) and it did work.

but did not work when sending html:

export async function articleExtractor(html) {
    try {
      const { content } = await extractFromHtml(html);
      // GC isHTML IN utils.js NOTE: nothing to do with the error i removed this isHTML part to test more.
      if (isHTML(content)) {
        console.log('HTML found');
        return content ;
      } else {
        console.log('HTML NOT found');
        return 'text/html not found';
      }
    } catch (error) {
      console.error('Error extracting text from HTML:', error);
      return 'Error extracting text from HTML';
    }
  }
ndaidong commented 1 month ago

@onigetoc I've added an example with pupperteer, please refer and try to apply it to your scenario.