adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.65k stars 261 forks source link

Is there a way to extract a top image from an article? #57

Closed ArturasDruteika closed 10 months ago

ArturasDruteika commented 3 years ago
article_trafilatura = trafilatura.bare_extraction(trafilatura.fetch_url(url), include_images=True)

When I set include_images=True, I get article src, but it seems to miss the top article image src. Is it meant to be this way or is it still not implemented or am I doing something wrong? I have tried passing different urls from different websites but still I cannot find a way how to extract that image.

adbar commented 3 years ago

Hi, trafilatura is optimized for text retrieval, if the images you're looking for are not within the main text they will not be displayed in the processed document. Can you please give me an example of a webpage where the top image is missing?

ArturasDruteika commented 3 years ago

@adbar Fox example this article: https://edition.cnn.com/2021/03/08/uk/meghan-harry-oprah-interview-recap-scli-gbr-intl/index.html. Maybe I am missing something but aren't all the article photos inside the main text frame?

adbar commented 3 years ago

Thanks for the input, this is indeed a bug, I'll look into it. Do you have other similar examples?

adbar commented 3 years ago

@ArturasDruteika Do you have other similar examples?

AliaHamwi commented 3 years ago

Hello I have this examples https://thepopinsider.com/news/shang-chi-trailer/ https://www.cnet.com/news/explosive-marvel-shang-chi-trailer-reveals-the-new-mandarin/

adbar commented 3 years ago

The second one now works but not this one: https://thepopinsider.com/news/shang-chi-trailer/

phongtnit commented 3 years ago

Hi @adbar , please check url https://www.scmp.com/lifestyle/families/article/2160779/three-reasons-more-chinese-indonesians-are-learning-mandarin-and . When I fetch the article with include_images=True, the graphic tag have no src attribute. In fact, the result looks like

<graphic alt="Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto"/>\n  <graphic alt="Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto"/>\n  <p>Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto</p>\n  <head rend="h1">Why more Chinese Indonesians are learning Mandarin, and nurturing their children&#8217;s sense of belonging to Chinese culture</head>\n  <p>Their culture was repressed for decades under Suharto&#8217;s anti-Chinese policy, but nowadays Chinese Indonesians are learning Mandarin and educating their children in the language. While most identify as Indonesian, China&#8217;s rise makes them proud</p>\n  <graphic alt="Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto"/>\n  <graphic alt="Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto"/>\n  <p>Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto</p>

And the link https://scrapingant.com/blog/scrape-dynamic-website-with-python

...
<graphic alt="Oleg Kulyk"/>
...
adbar commented 3 years ago

Hi @phongtnit, the SCMP article you mention is a different bug, I opened another thread.

The second link above and the others mentioned earlier are now fixed in a4d0bd1dc903d79b812f897d9e7cfe07ba6550f6. @phongtnit @ArturasDruteika @AliaHamwi

pranav-profjim commented 1 year ago

Hi, I know this is an old thread, but I have several examples where it fails to fetch images:

  1. https://openstax.org/books/anatomy-and-physiology/pages/6-1-the-functions-of-the-skeletal-system#fig-ch06_01_04
  2. https://openstax.org/books/anatomy-and-physiology/pages/1-1-overview-of-anatomy-and-physiology

Basically, all images on that domain seem to get missed on the output XML

adbar commented 1 year ago

Hi @pranav-profjim, images are not my priority but I reopen the thread and will have a look at your cases later.

pofey commented 1 year ago

<p><img src="/assets/dashboard_apikeys.fff14e02.jpg" alt="API Keys"></p> test page: https://docs.aiproxy.io/guide/quick-start This type of link also cannot be resolved, is it possible to fix it, or can you provide suggestions for improvement?

adbar commented 1 year ago

@pofey Images are not my priority, I'm not sure how to fix it but let's keep the issue open to have a look at it.