experius / SeoSnap

Server Side Rendering (SSR) for javascript applications
GNU General Public License v3.0
52 stars 12 forks source link

Caching partially rendered pages #4

Closed habitatcreative closed 2 years ago

habitatcreative commented 4 years ago

Hello,

I was able to install and run everything correctly, but I found something that I can't figure out how to solve.

It turns out that a lot of the pages that are being cached are not fully loaded. Some products have just the app shell and a loading indicator for the product itself, some have the main product details rendered, but other parts of the page showing the loading component.

Is there a time limit after which the page is saved, even if it's not finished loading or some kind of other setting? I was browsing the files, but couldn't find anything (I guess I am not looking in the right place)

Thanks in advance :)

ghost commented 4 years ago

Hello @habitatcreative

Can you check how rendertron renders your page? http://127.0.0.1:3000/render/

Port 3000 is rendertron directly Port 5000 is cache layer > rendertron

Its also critical to implement the following headers in your PWA

<meta name="render:status_code" content="404" />
<meta name="render:status_code" content="500" />

etc...

To tell rendertron a page is not fully rendered because of a error in one of the components @Jordaneisenburger knows how to implement this in PWA studio

habitatcreative commented 4 years ago

Hello @dheesbeen

I have added the headers, thank you for pointing this out. At least now the error pages won't be saved.

I see two issues:

  1. My initial problem was that Rendertron saved pages that are waiting for a GraphQl response while displaying a loading component. This is driving me crazy, since the Rendertron docs says:

Auto detecting loading function The service detects when a page has loaded by looking at the page load event, ensuring there are no outstanding network requests and that the page has had ample time to render.

It also says that "There is a hard limit of 10 seconds for rendering.". Since I have no other explanation on why the page is saved half-rendered without any errors, I think this might be the case although for a GraphQl response this is too much.

Is there a way to not save pages that hit this render limit? I couldn't find anything on this matter, but I am kind of new to this, as you may noticed.

  1. This is what I noticed today. I left the Cachewarmer to crawl the site overnight and this morning I noticed that a lot of pages cached with the error page in PWA Studio. Not sure if you are familiar with PWA Studio, but this is the ErrorView component with the InternalError message (not 404 or Out of Stock). Since there is no output on what caused this, it is still a mystery 😆 . However accessing Rendertron directly still returned the same page, but the real site is OK. What fixed this for me was to restart Rendertron and all pages rendered OK. I guess at some point during the night something went wrong and from that point on, all pages were saved as the error page.

I've been using Puppeteer before and I think browser.newPage(); actually gets rid of the cache from the previous request. Correct me if I am wrong. Is there a way to use something like this?

Sorry for the long reply and thanks for your time :)

habitatcreative commented 4 years ago

Hi@dheesbeen,

Ignore that first part, I cloned Rendertron separately and was able to edit the Puppeteer behaviour. Will try to use page.waitForSelector(). 👍

But on the second issue, after crawling about 800 pages without any issues, suddenly Rendertron stops working correctly. I will include the output, since it's a bit strange for me. I checked with the hosting, everything is OK, no drops, hangs etc.

I have included the moment after which Rendertron stops.

`2020-03-05 19:33:02 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cacheserver:5000/render/https://www.opti-wohnwelt.de/armlehnstuhl-bilbao-172236-geflecht-halbrund-farbe-stone-grey-gest-alu-ca-62-87-59cm.html> {'address': '/armlehnstuhl-bilbao-172236-geflecht-halbrund-farbe-stone-grey-gest-alu-ca-62-87-59cm.html', 'content_type': 'text/html; charset=utf-8', 'status_code': 200, 'cache_status': 'cached', 'cached_at': '2020-03-05T19:33:02', 'extract_fields': {'Title': 'Opti-Wohnwelt | Möbelhaus & Onlineshop - Entdecken & Shoppen!'}}

2020-03-05 19:33:05 [scrapy.core.engine] DEBUG: Crawled (200) <PUT http://cacheserver:5000/render/https://www.opti-wohnwelt.de/wangentisch-bilbao-172206-geflecht-halbrund-farbe-stone-grey-gest-alu-ca-220-76-100cm.html> (referer: http://m2opti.habitatmade.com/product_sitemap.xml)

2020-03-05 19:33:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cacheserver:5000/render/https://www.opti-wohnwelt.de/wangentisch-bilbao-172206-geflecht-halbrund-farbe-stone-grey-gest-alu-ca-220-76-100cm.html> {'address': '/wangentisch-bilbao-172206-geflecht-halbrund-farbe-stone-grey-gest-alu-ca-220-76-100cm.html', 'content_type': 'text/html; charset=utf-8', 'status_code': 200, 'cache_status': 'cached', 'cached_at': '2020-03-05T19:33:05', 'extract_fields': {'Title': 'Opti-Wohnwelt | Möbelhaus & Onlineshop - Entdecken & Shoppen!'}}

2020-03-05 19:33:07 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <PUT http://cacheserver:5000/render/https://www.opti-wohnwelt.de/positionsstuhl-bilbao-172204-geflecht-halbrund-farbe-stone-grey-gest-alu-ca-68-109-70cm.html> (failed 1 times): 500 Internal Server Error

2020-03-05 19:33:09 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <PUT http://cacheserver:5000/render/https://www.opti-wohnwelt.de/stapelstuhl-saigon-199374-geflecht-flach-gestell-stahl-geflechtfarbe-coffee-ca-57-94-64cm.html> (failed 1 times): 500 Internal Server Error

2020-03-05 19:33:12 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <PUT http://cacheserver:5000/render/https://www.opti-wohnwelt.de/sitzbank-saigon-199372-geflecht-flach-gestell-stahl-geflechtfarbe-coffee-ca-115-94-64cm.html> (failed 1 times): 500 Internal Server Error

2020-03-05 19:33:14 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <PUT http://cacheserver:5000/render/https://www.opti-wohnwelt.de/esstisch-saigon-199370-geflecht-flach-gestell-stahl-geflechtfarbe-coffee-ca-150-75-90cm.html> (failed 1 times): 500 Internal Server Error

2020-03-05 19:33:15 [scrapy.extensions.logstats] INFO: Crawled 580 pages (at 15 pages/min), scraped 579 items (at 15 items/min)`

I will not flood the post with this, but after some additional retries it reports: 2020-03-05 19:34:15 [scrapy.extensions.logstats] INFO: Crawled 581 pages (at 0 pages/min), scraped 580 items (at 0 items/min)

As you can see 15 pages per minute is not much, so it is not affecting the server.

Have you experienced something like that? Tried changing the pages in the sitemap, but still the same result.

lewisvoncken commented 2 years ago

@habitatcreative

In the latest experius/rendertron image we use a specific selector.

If you nee additional selectors let me know.

https://github.com/experius/rendertron/blob/docker/src/renderer.ts#L169

for now I will close the issue

borey88 commented 2 years ago

@habitatcreative Hello. Did you find solution for your second problem? I also have the same problems. Problem #1 I fixed but I'm stuck with issue #2. I believe reason is in docker-container. But I don't have idea how to fix it. In my case Rendertron cached pages with errors much earlier (about after 100 pages I think). I will be grateful for the hint. Thanks in advance