Closed ArturasDruteika closed 10 months ago
Hi, trafilatura is optimized for text retrieval, if the images you're looking for are not within the main text they will not be displayed in the processed document. Can you please give me an example of a webpage where the top image is missing?
@adbar Fox example this article: https://edition.cnn.com/2021/03/08/uk/meghan-harry-oprah-interview-recap-scli-gbr-intl/index.html. Maybe I am missing something but aren't all the article photos inside the main text frame?
Thanks for the input, this is indeed a bug, I'll look into it. Do you have other similar examples?
@ArturasDruteika Do you have other similar examples?
The second one now works but not this one: https://thepopinsider.com/news/shang-chi-trailer/
Hi @adbar , please check url https://www.scmp.com/lifestyle/families/article/2160779/three-reasons-more-chinese-indonesians-are-learning-mandarin-and . When I fetch the article with include_images=True
, the graphic
tag have no src
attribute. In fact, the result looks like
<graphic alt="Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto"/>\n <graphic alt="Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto"/>\n <p>Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto</p>\n <head rend="h1">Why more Chinese Indonesians are learning Mandarin, and nurturing their children’s sense of belonging to Chinese culture</head>\n <p>Their culture was repressed for decades under Suharto’s anti-Chinese policy, but nowadays Chinese Indonesians are learning Mandarin and educating their children in the language. While most identify as Indonesian, China’s rise makes them proud</p>\n <graphic alt="Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto"/>\n <graphic alt="Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto"/>\n <p>Kevin Mardhi and his wife, Martha Tanudjaya, with their five-year-old daughter Ziva in Jakarta. Mardhi speaks Bahasa Indonesia and English, while Tanudjaya also speaks Chinese. Ziva is learning Chinese and English. Photo: Andra Fembriarto</p>
And the link https://scrapingant.com/blog/scrape-dynamic-website-with-python
...
<graphic alt="Oleg Kulyk"/>
...
Hi @phongtnit, the SCMP article you mention is a different bug, I opened another thread.
The second link above and the others mentioned earlier are now fixed in a4d0bd1dc903d79b812f897d9e7cfe07ba6550f6. @phongtnit @ArturasDruteika @AliaHamwi
Hi, I know this is an old thread, but I have several examples where it fails to fetch images:
Basically, all images on that domain seem to get missed on the output XML
Hi @pranav-profjim, images are not my priority but I reopen the thread and will have a look at your cases later.
<p><img src="/assets/dashboard_apikeys.fff14e02.jpg" alt="API Keys"></p>
test page: https://docs.aiproxy.io/guide/quick-start
This type of link also cannot be resolved, is it possible to fix it, or can you provide suggestions for improvement?
@pofey Images are not my priority, I'm not sure how to fix it but let's keep the issue open to have a look at it.
When I set include_images=True, I get article src, but it seems to miss the top article image src. Is it meant to be this way or is it still not implemented or am I doing something wrong? I have tried passing different urls from different websites but still I cannot find a way how to extract that image.