extractus / article-extractor

To extract main article from given URL with Node.js
https://extractor-demos.pages.dev/article-extractor
MIT License
1.49k stars 132 forks source link

`extractFromHtml` missed an `<h1>` in the `content` json result. #396

Open bryantwilliam opened 1 month ago

bryantwilliam commented 1 month ago

Using this code:

const html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sample Article Page</title>
</head>
<body>
  <article>
    <h1>Sample Article</h1>
    <p>This is a paragraph with some sample content.<br>The Next Line</p>
    <h2>List Example</h2>
    <ul>
      <li>List item 1</li>
      <li>List item 2</li>
      <li>List item 3</li>
    </ul>
    <h2>Table Example</h2>
    <table border="1">
      <tr>
        <th>Header 1</th>
        <th>Header 2</th>
      </tr>
      <tr>
        <td>Data 1</td>
        <td>Data 2</td>
      </tr>
    </table>
    <h2>Image Example</h2>
    <p><img src="https://hips.hearstapps.com/hmg-prod/images/bright-forget-me-nots-royalty-free-image-1677788394.jpg" alt="Flowers image"></p>
    <h2>IFrame Example</h2>
    <p><iframe width="520" height="300" src="https://www.youtube.com/embed/dQw4w9WgXcQ"></iframe></p>
    <h2>Video Example</h2>
    <video width="320" height="240" controls>
      <source src="http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4" type="video/mp4">
      Your browser does not support the video tag.
      <figcaption>Hello World</figcaption>
    </video>
    <h2>Another Video Example</h2>
    <video width="320" height="240" controls>
      <source src="http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ElephantsDream.mp4" type="video/mp4">
      Your browser does not support the video tag.
      <figcaption>Hello World</figcaption>
    </video>
    <h2>Another Random Image</h2>
    <p><img src="https://www.shutterstock.com/shutterstock/photos/2056485080/display_1500/stock-vector-address-and-navigation-bar-icon-business-concept-search-www-http-pictogram-d-concept-2056485080.jpg" alt="Flowers image"></p>
  </article>
</body>
</html>
''';
const url = "https://www.goodreads.com/book/show/58612786-100m-offers";

const { extractFromHtml } = await import('@extractus/article-extractor');
return await extractFromHtml(html, url);

Returns this:

{
  "url": "https://www.example.com/the-page-i-got-the-source-from",
  "title": "Sample Article Page",
  "description": "This is a paragraph with some sample content.The Next Line    List Example          List item 1      List item 2      List item 3        Table Example                  Header 1        Header...",
  "links": [
    "https://www.example.com/the-page-i-got-the-source-from"
  ],
  "image": "",
  "content": "<article>\n
    <p>This is a paragraph with some sample content.<br />The Next Line</p>\n
    <h2>List Example</h2>\n
    <ul>\n
      <li>List item 1</li>\n
      <li>List item 2</li>\n
      <li>List item 3</li>\n
    </ul>\n
    <h2>Table Example</h2>\n
    <table>\n
      <tr>\n
        <th>Header 1</th>\n
        <th>Header 2</th>\n
      </tr>\n
      <tr>\n
        <td>Data 1</td>\n
        <td>Data 2</td>\n
      </tr>\n
    </table>\n
    <h2>Image Example</h2>\n
    <p><img src=\"https://hips.hearstapps.com/hmg-prod/images/bright-forget-me-nots-royalty-free-image-1677788394.jpg\" alt=\"Flowers image\" /></p>\n
    <h2>IFrame Example</h2>\n
    <p><iframe width=\"520\" height=\"300\" src=\"https://www.youtube.com/embed/dQw4w9WgXcQ\"></iframe></p>\n
    <h2>Video Example</h2>\n
    <video width=\"320\" height=\"240\" controls>\n
      <source src=\"http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4\" type=\"video/mp4\"></source>\n
      Your browser does not support the video tag.\n
      <figcaption>Hello World</figcaption>\n
    </video>\n
    <h2>Another Video Example</h2>\n
    <video width=\"320\" height=\"240\" controls>\n
      <source src=\"http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ElephantsDream.mp4\" type=\"video/mp4\"></source>\n
      Your browser does not support the video tag.\n
      <figcaption>Hello World</figcaption>\n
    </video>\n
    <h2>Another Random Image</h2>\n
    <p><img src=\"https://www.shutterstock.com/shutterstock/photos/2056485080/display_1500/stock-vector-address-and-navigation-bar-icon-business-concept-search-www-http-pictogram-d-concept-2056485080.jpg\" alt=\"Flowers image\" /></p>\n
  </article>",
  "author": "",
  "favicon": "",
  "source": "example.com",
  "published": "",
  "ttr": 13,
  "type": ""
}

As you can see in the content, it's missing the <h1>Sample Article</h1> part at the top. And I can't see it anywhere else in the JSON.

Not sure if this is expected behaviour, but I would like to have it not remove the <h1> from the content, or at least put it as a new field in the JSON.

ndaidong commented 1 month ago

@bryantwilliam yeap, the default algorithm often has limitations. That's the time when you need transformations.