adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.43k stars 251 forks source link

It's set include_images=True, but there is no picture #610

Open dark2star opened 3 months ago

dark2star commented 3 months ago

That's my code: `from trafilatura import fetch_url, extract

url = 'https://shumeipai.nxez.com/2020/06/11/stanford-pupper-assembly-tutorial.html' downloaded = fetch_url(url) result = extract(downloaded, output_format='markdown', favor_recall=True, include_images=True, include_links=True)`

adbar commented 3 months ago

I can indeed reproduce the bug. Images are not my priority, the corresponding code mostly consists of a series of contributions and it's not perfect. Let's see if someone can improve on this.

dark2star commented 3 months ago

I can indeed reproduce the bug. Images are not my priority, the corresponding code mostly consists of a series of contributions and it's not perfect. Let's see if someone can improve on this.我确实可以重现这个错误。图片不是我的重点,相应的代码主要由一系列贡献组成,而且并不完美。让我们看看是否有人能对此加以改进。

Thank you very much, I found that most of the sites can't get pictures in the process, and this is just one of the cases

altblog commented 3 months ago

That's my code: `from trafilatura import fetch_url, extract

url = 'https://shumeipai.nxez.com/2020/06/11/stanford-pupper-assembly-tutorial.html' downloaded = fetch_url(url) result = extract(downloaded, output_format='markdown', favor_recall=True, include_images=True, include_links=True)`

Try it:

from trafilatura import fetch_url, extract
import re

url = 'https://shumeipai.nxez.com/2020/06/11/stanford-pupper-assembly-tutorial.html'
downloaded = fetch_url(url)

img_src_regex = r'<img[^>]+src="([^"]+)"[^>]*>'
def replace_img_tags(match):
    src = match.group(1)
    return f'111222333000-{src}-000333222111'

downloaded = re.sub(img_src_regex, replace_img_tags, downloaded)

result = extract(downloaded, output_format='markdown', favor_recall=True, include_images=True, include_links=True)

result = re.sub("111222333000-","<img src=\"", result)
result = re.sub("-000333222111","\">", result)
print(result)
dark2star commented 3 months ago

That's my code: from trafilatura import fetch_url, extract url = 'https://shumeipai.nxez.com/2020/06/11/stanford-pupper-assembly-tutorial.html' downloaded = fetch_url(url) result = extract(downloaded, output_format='markdown', favor_recall=True, include_images=True, include_links=True)

Try it:

from trafilatura import fetch_url, extract
import re

url = 'https://shumeipai.nxez.com/2020/06/11/stanford-pupper-assembly-tutorial.html'
downloaded = fetch_url(url)

img_src_regex = r'<img[^>]+src="([^"]+)"[^>]*>'
def replace_img_tags(match):
    src = match.group(1)
    return f'111222333000-{src}-000333222111'

downloaded = re.sub(img_src_regex, replace_img_tags, downloaded)

result = extract(downloaded, output_format='markdown', favor_recall=True, include_images=True, include_links=True)

result = re.sub("111222333000-","<img src=\"", result)
result = re.sub("-000333222111","\">", result)
print(result)

Thanks, it worked, I modified the source code of trafilatura and was able to solve part of the problem, but as I was using it I realized that most of the url's didn't work perfectly, there were too many adaptations needed, gave up!

adbar commented 1 month ago

For further reference: see also https://github.com/adbar/trafilatura/issues/662.