Website down when Tweakwise can't be reached

DerFrenk commented 3 years ago

Issue Brief

Recently we had a routing issue in our datacenter. Due to this issue, our Magento 2 installation couldn't connect to Tweakwise. This resulted in product detail pages that took over 2 minutes to load. Eventually it could have brought down the whole website, because of the amount of (long running) PHP processes.

Environment

PHP Version: 7.3
Magento Version: 2.3.6
Tweakwise Version: 3.2.0
Magento Deploy Mode: (production)
Third party modules: not relevant

Steps to reproduce

Install Magento and Tweakwise
Make sure your product detail pages contain an 'upsell' and a 'related' products block
Put a var_dump or some other debug in \Emico\Tweakwise\Model\Client::doRequest
Open a PDP, you'll probably see two var_dumps, because two requests were executed (1 for upsell, 1 for related)
Open Model/Config.php and set the REQUEST_TIMEOUT to 1. Set the SERVER_URL and FALLBACK_SERVER_URL to https://blackhole.webpagetest.org. This endpoint drops all packages it receives, so you can simulate an unreachable API.
Refresh the PDP. On my development environment I now see 8 debug statements instead of 2. On our production environment, even 14 API-calls were executed (!!) according to NewRelic. Since the timout is normally 5 seconds instead of 1, this is a serious issue.

afbeelding

As you can see, it took over 2 minutes to process this page request (due to routing issues, on the production env!). All Curl-calls were made by Tweakwise. I think the issue comes from module-catalog/view/frontend/templates/product/list/items.phtml. See the case 'upsell':. In if ($exist = count($block->getItemCollection()->getItems())) { a collection is retrieved. Tweakwise overrides this call with a plugin. This call fails, it'll try the fallback and also that one fails. The plugin won't modify the collection, so the 'normal' Magento is used, and the code will proceed inside the if-block. In that block it'll get the items again ($items = $block->getItemCollection()->getItems();). Also $limit = $block->getItemLimit('upsell'); triggers the two requests via a Tweakwise plugin.

The plugin tries to get the data (+does a fallback call) at each time the plugin gets called. This made all our PDP's unreachable.

Actual result

8 debug statements instead of 2 on my dev environment
14 API-calls instead of 2 on production environment

Expected result

If the Tweakwise endpoint and the fallback are unreachable, subsequent calls to the plugin in the same request should not do a call to Tweakwise. The orginal ItemCollection, or something like an empty array, should be used instead.
Even better would be that if Tweakwise and the fallback are unreachable, subsequent page requests won't even do a call to Tweakwise for some period.
Now there's only the default 5s timeout. Guzzle has an option connect_timeout (https://docs.guzzlephp.org/en/stable/request-options.html#connect-timeout). This option should be used and when connecting takes longer than something like 250ms or 500ms, the request should fail.

I hope I explained it well. If you need any more details, or want more detailed New Relic graphs, please let me know.

Frank

DerFrenk commented 2 years ago

The Tweakwise Navigator was temporary unreachable this evening and that caused some serious spikes on our webshops. Here's an example:

Multiple of our webshops were down for a couple of minutes because of the problem I described in this ticket.

remcoselles commented 2 years ago

Hi Frenk,

First of all, excuse us for the delayed response. We looked into this issue and there are a couple of things to say about this case.

Fallback to magento default

We currently do no support a failover to the magento default filtering solution in case of Tweakwise downtime. At Tweakwise we put a lot of effort in keeping Tweakwise online and prevent any downtime, so the chances of Tweakwise being down are very low. In this specific case not Tweakwise experienced downtime, but we've measured network connectivity issues at Hetzner Neurenburg, Germany. This datacenter is also hosting keessmit.nl. So even if we would have implemented the magento fallback in this specific case, likely keessmit would experience other connectivity issues as well.

High load generated by Tweakwise extension

I can see how the connectivity issues generated a cascading effect on the performance. But I don't think we figured out the root cause yet. Because when the fallback is triggered, it will remain active for some time. So after the connectivity issues would have been resolved, we should still see some traffic at our fallback endpoint (gateway.tweakwisenavigator.com). But we haven't seen any. So I think the fallback has never been active during the downtime. I think the Guzzle timeout triggered (5 seconds), but the IF statement returned false (if ($e instanceof ConnectException && !$this->endpointManager->isFallback()) {) because it was not caused by a connectexception but a regular timeout. So the fallback logic (based on connectivity issues) is not in sync with timeout logic (regular timeout). In that case your suggestion of replacing the Guzzle timeout with a connect_timeout would make sense, but then a failover would not be triggered on other kind of failures.

Anyway, we will take another look into failover logic improvements! If you have any other insights regarding these issues, please feel free to share.

DerFrenk commented 2 years ago

Hi Remco,

No problem! Thanks for your reply.

the chances of Tweakwise being down are very low

There is always a small change that things go wrong. Recently, a major customer of ours had the entire company (>1 billion yearly revenue) down because of a ransomware attack. When the whole Facebook network can be unreachable for 7 hours like two days ago, a similar downtime could happen to Tweakwise.

likely keessmit would experience other connectivity issues as well

That's right, we had routing issues in our datacenter. But the fact that we use third parties, doesn't mean we want to depend on it. An upsell or related products block should never make a PDP fail, since it's non critical content.

when the fallback is triggered, it will remain active for some time. So after the connectivity issues would have been resolved, we should still see some traffic at our fallback endpoint

We currently use version 3.2 on production which doesn't contain the new fallback logic (3.3). We didn't knew that some fallback logic was already built. During the routing issues we couldn't connect to the outside world, so we also couldn't connect to the fallback url. Our version always tries the primary URL first, and after network activity was restored it was able to connect to the primary URL, so the fallback was never actually visited.

because it was not caused by a connectexception but a regular timeout

Just updated my local version to 3.3.8 and tested it by setting the URLs in Config.php and EndpointManager.php to https://blackhole.webpagetest.org, I do get a ConnectException which triggers handleConnectException().

Every PDP still needs >10s to load, since it contains an upsell and a related products block. Ideally the plugin shouldn't even make an external call during a page request, instead something like a cronjob should be used to retrieve the data async. As long as that's not the case, at least RequestOptions::CONNECT_TIMEOUT => 0.5, would help I think.

===

I just discovered a big issue in the fallback mechanism. When the fallback URL is unreachable, doRequest ends in a loop until DOWN_PERIOD expires. In version 3.2 there was a variable $isFallback which prevented this, but this mechanism was removed in 3.3.0.

This code in the EndpointManager:

if ($downUntil && abs($now - $downUntil) < self::DOWN_PERIOD) {
    return;
}

$twPrimaryLastDown->setData('plain_value', $now + self::DOWN_PERIOD);

should be:

if ($downUntil && $now < $downUntil) {
    return;
}

$twPrimaryLastDown->setData('plain_value', $now + self::DOWN_PERIOD);

abs($now - $downUntil) < self::DOWN_PERIOD means 'down until $downUntil PLUS down_period'. When I open open my DB and set the __tw_primary_down_timer variable_value to the current unix timestamp, it takes 5 minutes to load a PDP. It will do hundreds of requests until $downUntil+DOWN_PERIOD.

It's bit tough to describe, so I hope this is clear. If not, please let me know and we can set up a call so I can show it.

remcoselles commented 2 years ago

Hi Frank,

We discussed your concerns internally and decided to move forward in the following way:

Stability regarding failover logic

You mention a big issue in the fallback mechanism, we want to take that seriously. Aside from the Magento extension we've implemented the fallback logic multiple times in multiple programming languages. To increase the stability of the failover mechanism, we will create a single failover best practice and rollout this mechanism in al our packages.

Fallback to Magento default

For now, we've decided not to implement a failover mechanism to fallback to Magento default behavior when both the primary and fallback endpoints are offline. We would expect this would introduce more problems than it would solve. If you would like to further discuss this matter feel free to contact me. We can then schedule a video call.

EmicoEcommerce / Magento2Tweakwise-archived