Picolab / Manifold

An IoT application to manage your Pico Device Shadows
MIT License
4 stars 1 forks source link

Manifold silent crash #192

Open b1conrad opened 3 years ago

b1conrad commented 3 years ago

Recently, Manifold stopped working. It would log the owner in, but show none of her thing cards. Errors were thrown in the console log of the browser, but no other indication was given. The screen was indistinguishable from that of an owner who had not yet created any things.

It appears that an event (manifold:update_app_list) sent from Manifold was the culprit. The event stopped the pico engine for two minutes, as shown by this extract from the logs.

2021-05-19T15:16:00.363Z episode_start event/YGqfdevaDDtgU85acjzzKM/eid/manifold/update_version
2021-05-19T15:16:00.364Z event received: manifold/update_version
2021-05-19T15:16:00.364Z event added to pico queue: cjkvb40p701sghtlsrrput24q
2021-05-19T15:16:00.365Z event being processed
2021-05-19T15:16:00.366Z rule added to schedule: io.picolabs.manifold_pico -> updateManifoldVersion
2021-05-19T15:16:00.366Z rule added to schedule: io.picolabs.notifications -> updateManifoldAppList
2021-05-19T15:16:00.367Z rule selected: io.picolabs.manifold_pico -> updateManifoldVersion
2021-05-19T15:16:00.368Z klog needed []
2021-05-19T15:16:00.368Z not fired
2021-05-19T15:16:00.368Z rule selected: io.picolabs.notifications -> updateManifoldAppList
2021-05-19T15:16:00.415Z klog host "https://manifold.picolabs.io:9090"
2021-05-19T15:18:07.708Z ETIMEDOUT
2021-05-19T15:18:07.708Z episode_stop

The updateManifoldAppList rule is selected and after its first klog the pico-engine is busy for two minutes until a timeout wakes it up again. A couple of queries come in, but they are not processed until after the timeout (and that was too late, because the nginx gateway had already timed out by then).

A temporary patch was made by changing the io.picolabs.notifications ruleset to not select on that event. See the #manifold channel in our Pico Labs slack organization for details.

This situation needs to be investigated, the root cause identified, and the problem fixed properly.

b1conrad commented 3 years ago

It is unclear what the purpose of the updateManifoldAppList rule is. It loops over all of the things, and sends each one an event that will be selected by every application ruleset installed in it. Then the results of all this is cached in an entity variable which doesn't seem to be used.

windley commented 3 years ago

What raises the manifold:update_app_list? Cron?

b1conrad commented 3 years ago

The event is raised by the Manifold SPA here, called by React as soon as this component has been rendered into the DOM tree.

Ah, wait, I see the point of your question. The manifold:update_app_list should exhibit the same symptom, since is selects the rule which fails. And it appears to do so. When the list of applications is brought up when setting up notifications for an app in my thing, everything seems to work fine. Expect that I begin getting the error pop-up, and the console log shows the same kind of failure.

So, the problem is the rule fails during evaluation. Avoiding selecting that rule with manifold:update_app_list allows Manifold to present the thing cards. Getting the cards (picos) for this owner's things happens here so we would always select the problematic rule when getting those picos.

We need to understand why this rule fails in line 73, and fix that.

b1conrad commented 3 years ago

Another question is why did the failure begin to appear sometime after IIW (April 20-22) and before the certificate expired (May 16)?

I don't see evidence of code changes anywhere. The latest commit to the Manifold repo was March 5, when I changed the Google profile keys. Looking at the rulesets of the instance... There are the changes Beto and I made to the Manifold app io.picolabs.manifold_cloud_agent on April 21, but these helped the app, and shouldn't have impacted notifications. Another innocuous change was made to io.picolabs.notifications on May 13.

windley commented 3 years ago

Shouldn't line 73 be using wrangler:pico_query() so that it doesn't go over HTTP?

b1conrad commented 3 years ago

That's was my first reaction, but it is an event and not a query, that is being sent to one of the things. Also, the v0 pico engine doesn't have a primitive for calling a function in another ruleset without using HTTP.

b1conrad commented 3 years ago

Manifold sends an event and then a query in quick succession, to get the things. The logging for the event is shown in the first issue comment, and has a two minute gap. The logging for the query fits into that two minute gap:

2021-05-19T15:16:43.727Z episode_start query/YGqfdevaDDtgU85acjzzKM/io.picolabs.manifold_pico/getManifoldInfo
2021-05-19T15:16:43.727Z query received: io.picolabs.manifold_pico/getManifoldInfo
2021-05-19T15:16:43.728Z query added to pico queue: cjkvb40p701sghtlsrrput24q
2021-05-19T15:18:07.741Z {"things":{...},"communities":{}}
2021-05-19T15:18:07.741Z episode_stop

The query correctly puts together the information and returns it to the browser, but meanwhile nginx has timed out and the browser thinks there is a CORS error because of a missing header (which, of course, nginx has not provided).

windley commented 3 years ago

OK, so that begs the question why is it using http:get() instead of event:send()?

b1conrad commented 3 years ago

Because it is gathering all the directives that come in from all the rules (in all the rulesets installed in a thing pico) which select on the manifold:apps event. Perhaps this would be a good place to use the scatter-gather pattern instead. That is, instead of returning the results in directives, each rule would use event:send to supply the information.

windley commented 3 years ago

This is why event:send() in the Classic engine had a synchronous mode. You could add the attribute _async with value 0 to the attributes and then gather results (from a system:send_complete event)

As we discovered, having people using HTTP to query picos on the same engine often results in deadlock situations. That seems to be what's happening here, no?

But, as you say, using scatter-gather is probably the right solution since someone with lots of things is going to have a long wait for synchronous interactions to finish.