It's set include_images=True, but there is no picture

dark2star commented 3 months ago

That's my code： `from trafilatura import fetch_url, extract

url = 'https://shumeipai.nxez.com/2020/06/11/stanford-pupper-assembly-tutorial.html' downloaded = fetch_url(url) result = extract(downloaded, output_format='markdown', favor_recall=True, include_images=True, include_links=True)`

adbar commented 3 months ago

I can indeed reproduce the bug. Images are not my priority, the corresponding code mostly consists of a series of contributions and it's not perfect. Let's see if someone can improve on this.

dark2star commented 3 months ago

I can indeed reproduce the bug. Images are not my priority, the corresponding code mostly consists of a series of contributions and it's not perfect. Let's see if someone can improve on this.我确实可以重现这个错误。图片不是我的重点，相应的代码主要由一系列贡献组成，而且并不完美。让我们看看是否有人能对此加以改进。

Thank you very much, I found that most of the sites can't get pictures in the process, and this is just one of the cases

altblog commented 3 months ago

That's my code： `from trafilatura import fetch_url, extract

url = 'https://shumeipai.nxez.com/2020/06/11/stanford-pupper-assembly-tutorial.html' downloaded = fetch_url(url) result = extract(downloaded, output_format='markdown', favor_recall=True, include_images=True, include_links=True)`

Try it:

from trafilatura import fetch_url, extract
import re

url = 'https://shumeipai.nxez.com/2020/06/11/stanford-pupper-assembly-tutorial.html'
downloaded = fetch_url(url)

img_src_regex = r'<img[^>]+src="([^"]+)"[^>]*>'
def replace_img_tags(match):
    src = match.group(1)
    return f'111222333000-{src}-000333222111'

downloaded = re.sub(img_src_regex, replace_img_tags, downloaded)

result = extract(downloaded, output_format='markdown', favor_recall=True, include_images=True, include_links=True)

result = re.sub("111222333000-","<img src=\"", result)
result = re.sub("-000333222111","\">", result)
print(result)

dark2star commented 3 months ago

That's my code： from trafilatura import fetch_url, extract url = 'https://shumeipai.nxez.com/2020/06/11/stanford-pupper-assembly-tutorial.html' downloaded = fetch_url(url) result = extract(downloaded, output_format='markdown', favor_recall=True, include_images=True, include_links=True)

Try it:
from trafilatura import fetch_url, extract
import re

url = 'https://shumeipai.nxez.com/2020/06/11/stanford-pupper-assembly-tutorial.html'
downloaded = fetch_url(url)

img_src_regex = r'<img[^>]+src="([^"]+)"[^>]*>'
def replace_img_tags(match):
    src = match.group(1)
    return f'111222333000-{src}-000333222111'

downloaded = re.sub(img_src_regex, replace_img_tags, downloaded)

result = extract(downloaded, output_format='markdown', favor_recall=True, include_images=True, include_links=True)

result = re.sub("111222333000-","<img src=\"", result)
result = re.sub("-000333222111","\">", result)
print(result)

Thanks, it worked, I modified the source code of trafilatura and was able to solve part of the problem, but as I was using it I realized that most of the url's didn't work perfectly, there were too many adaptations needed, gave up!

adbar commented 1 month ago

For further reference: see also https://github.com/adbar/trafilatura/issues/662.

adbar / trafilatura

It's set include_images=True, but there is no picture #610