feediron / ttrss_plugin-feediron

Evolution of ttrss_plugin-af_feedmod
https://discourse.tt-rss.org/t/plugin-update-feediron-v1-2-0/2018
MIT License
204 stars 34 forks source link

New York Times Images Not Loading #181

Closed andonagio closed 2 years ago

andonagio commented 2 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue. YOU MAY DELETE THE PREREQUISITES SECTION.

Expected Behavior

When getting the content of a New York Times article, the entire article should load, including any images.

Current Behavior

Everything but the images loads.

Steps to Reproduce

Please provide detailed steps for reproducing the issue.

  1. Create a filter for nytimes.com. A basic one looks like this:
"nytimes.com": {
     "type": "xpath",
     "xpath": "article"
}
  1. Put it in FeedIron.
  2. Run FeedIron against any nytimes.com RSS feed.

Checking the XML content of the resulting RSS feed, the picture and img tags from the website are not present in the feed's contents. If I wget the article's URL, those tags are missing from these results as well. The issue seems to be that the New York Times website doesn't load the images immediately with the article. Perhaps it's doing some kind of lazy loading? I can't tell, but I'd like to know if there is a way around this problem.

dugite-code commented 2 years ago

I believe the issue is the presence of the srcset attribute. This is a common problem when fetching images, try use the modify block to change that attribute to a null, this appears to work for myself.

{
     "type": "xpath",
     "xpath": "article",
        "modify": [
            {
                "type": "replace",
                "search": [
                    "srcset="
                ],
                "replace": [
                    "null="
                ]
            }
        ]
}
andonagio commented 2 years ago

I'm able to get the article header image and the author bio image. However, any images embedded within the article are missing, and the above filter doesn't seem to help:

NYT_MissingImage

I think the issue is that there is no image data for FeedIron to modify. I can see the image tags in in my browser's inspection tool:

NYT_MissingImage_Website

However, when I then look at the page's source, there are no picture or img tags to be found:

NYT_MissingImage_SourceCode

This is also what I see when I try to download the source via wget or lynx. I'm confused as to how the site is handling images and how I can get around this.

Is anybody able to successfully get embedded images in NYT articles? If so, how?

dugite-code commented 2 years ago

So I did a wget on the page you used as an example. It looks like there is a bunch of javascript they are using to lazy load the images. This used to be very common prior to the introduction of the loading="lazy" attribute.

There isn't a great method to get the rendered webpage outside of something like FlareSolver or another headless browser setup. Currently there isn't anything built into feediron, although I have planned on introducing modular fetching like the filters for a while now.