Open DerFrenk opened 3 years ago
The Tweakwise Navigator was temporary unreachable this evening and that caused some serious spikes on our webshops. Here's an example:
Multiple of our webshops were down for a couple of minutes because of the problem I described in this ticket.
Hi Frenk,
First of all, excuse us for the delayed response. We looked into this issue and there are a couple of things to say about this case.
We currently do no support a failover to the magento default filtering solution in case of Tweakwise downtime. At Tweakwise we put a lot of effort in keeping Tweakwise online and prevent any downtime, so the chances of Tweakwise being down are very low. In this specific case not Tweakwise experienced downtime, but we've measured network connectivity issues at Hetzner Neurenburg, Germany. This datacenter is also hosting keessmit.nl. So even if we would have implemented the magento fallback in this specific case, likely keessmit would experience other connectivity issues as well.
I can see how the connectivity issues generated a cascading effect on the performance. But I don't think we figured out the root cause yet. Because when the fallback is triggered, it will remain active for some time. So after the connectivity issues would have been resolved, we should still see some traffic at our fallback endpoint (gateway.tweakwisenavigator.com). But we haven't seen any. So I think the fallback has never been active during the downtime. I think the Guzzle timeout triggered (5 seconds), but the IF statement returned false (if ($e instanceof ConnectException && !$this->endpointManager->isFallback()) {
) because it was not caused by a connectexception but a regular timeout.
So the fallback logic (based on connectivity issues) is not in sync with timeout logic (regular timeout).
In that case your suggestion of replacing the Guzzle timeout with a connect_timeout would make sense, but then a failover would not be triggered on other kind of failures.
Anyway, we will take another look into failover logic improvements! If you have any other insights regarding these issues, please feel free to share.
Hi Remco,
No problem! Thanks for your reply.
the chances of Tweakwise being down are very low
There is always a small change that things go wrong. Recently, a major customer of ours had the entire company (>1 billion yearly revenue) down because of a ransomware attack. When the whole Facebook network can be unreachable for 7 hours like two days ago, a similar downtime could happen to Tweakwise.
likely keessmit would experience other connectivity issues as well
That's right, we had routing issues in our datacenter. But the fact that we use third parties, doesn't mean we want to depend on it. An upsell or related products block should never make a PDP fail, since it's non critical content.
when the fallback is triggered, it will remain active for some time. So after the connectivity issues would have been resolved, we should still see some traffic at our fallback endpoint
We currently use version 3.2 on production which doesn't contain the new fallback logic (3.3). We didn't knew that some fallback logic was already built. During the routing issues we couldn't connect to the outside world, so we also couldn't connect to the fallback url. Our version always tries the primary URL first, and after network activity was restored it was able to connect to the primary URL, so the fallback was never actually visited.
because it was not caused by a connectexception but a regular timeout
Just updated my local version to 3.3.8 and tested it by setting the URLs in Config.php and EndpointManager.php to https://blackhole.webpagetest.org
, I do get a ConnectException which triggers handleConnectException()
.
Every PDP still needs >10s to load, since it contains an upsell and a related products block. Ideally the plugin shouldn't even make an external call during a page request, instead something like a cronjob should be used to retrieve the data async. As long as that's not the case, at least RequestOptions::CONNECT_TIMEOUT => 0.5,
would help I think.
===
I just discovered a big issue in the fallback mechanism. When the fallback URL is unreachable, doRequest ends in a loop until DOWN_PERIOD
expires. In version 3.2 there was a variable $isFallback which prevented this, but this mechanism was removed in 3.3.0.
This code in the EndpointManager:
if ($downUntil && abs($now - $downUntil) < self::DOWN_PERIOD) {
return;
}
$twPrimaryLastDown->setData('plain_value', $now + self::DOWN_PERIOD);
should be:
if ($downUntil && $now < $downUntil) {
return;
}
$twPrimaryLastDown->setData('plain_value', $now + self::DOWN_PERIOD);
abs($now - $downUntil) < self::DOWN_PERIOD
means 'down until $downUntil PLUS down_period'. When I open open my DB and set the __tw_primary_down_timer
variable_value to the current unix timestamp, it takes 5 minutes to load a PDP. It will do hundreds of requests until $downUntil+DOWN_PERIOD.
It's bit tough to describe, so I hope this is clear. If not, please let me know and we can set up a call so I can show it.
Hi Frank,
We discussed your concerns internally and decided to move forward in the following way:
You mention a big issue in the fallback mechanism, we want to take that seriously. Aside from the Magento extension we've implemented the fallback logic multiple times in multiple programming languages. To increase the stability of the failover mechanism, we will create a single failover best practice and rollout this mechanism in al our packages.
For now, we've decided not to implement a failover mechanism to fallback to Magento default behavior when both the primary and fallback endpoints are offline. We would expect this would introduce more problems than it would solve. If you would like to further discuss this matter feel free to contact me. We can then schedule a video call.
Issue Brief
Recently we had a routing issue in our datacenter. Due to this issue, our Magento 2 installation couldn't connect to Tweakwise. This resulted in product detail pages that took over 2 minutes to load. Eventually it could have brought down the whole website, because of the amount of (long running) PHP processes.
Environment
Steps to reproduce
\Emico\Tweakwise\Model\Client::doRequest
REQUEST_TIMEOUT
to1
. Set theSERVER_URL
andFALLBACK_SERVER_URL
tohttps://blackhole.webpagetest.org
. This endpoint drops all packages it receives, so you can simulate an unreachable API.As you can see, it took over 2 minutes to process this page request (due to routing issues, on the production env!). All Curl-calls were made by Tweakwise. I think the issue comes from module-catalog/view/frontend/templates/product/list/items.phtml. See the
case 'upsell':
. Inif ($exist = count($block->getItemCollection()->getItems())) {
a collection is retrieved. Tweakwise overrides this call with a plugin. This call fails, it'll try the fallback and also that one fails. The plugin won't modify the collection, so the 'normal' Magento is used, and the code will proceed inside the if-block. In that block it'll get the items again ($items = $block->getItemCollection()->getItems();
). Also$limit = $block->getItemLimit('upsell');
triggers the two requests via a Tweakwise plugin.The plugin tries to get the data (+does a fallback call) at each time the plugin gets called. This made all our PDP's unreachable.
Actual result
Expected result
connect_timeout
(https://docs.guzzlephp.org/en/stable/request-options.html#connect-timeout). This option should be used and when connecting takes longer than something like 250ms or 500ms, the request should fail.I hope I explained it well. If you need any more details, or want more detailed New Relic graphs, please let me know.
Frank