Uninett / Howitz

1 stars 3 forks source link

[BUG]: Connection loss to Zino server is silently ignored on NTIE refresh #116

Closed podliashanyk closed 3 weeks ago

podliashanyk commented 1 month ago

Seems like connection loss to Zino server is not "picked up" by Howitz when requesting for NTIE updates. Further investigation is needed.

To reproduce:

  1. Run howitz, browse
  2. Turn off the internet where you run howitz
  3. Wait at least longer than refresh_interval
  4. There is nothing that complains about the connection being gone

Curitz will eventually quit with the error message "Connection lost with the zino server...". Howitz does nothing.

hmpf commented 4 weeks ago

The code in zinolib.ritz.notifier.poll either:

None is returned if an OSError with errno.EAGAIN or errno.EWOULDBLOCK was raised while there was no newline in notifier._buff. EAGAIN/EWOULDBLOCK on linux means "temporary failure, please feel free to try again".

This code never raises RetryError.

hmpf commented 4 weeks ago

The code in zinolib.controllers.zino1.UpdateHandler.get_event_update calls zinolib.ritz.notifier.poll.

It does not attempt to catch any exceptions.

If .poll returns None (after OSError errno.EAGAIN or errno.EWOULDBLOCK) it returns False, otherwise it handles the update.

The actual handler either returns the id of an updated event or None if it's a new, unfinished event or False if it was an unknown/unhandled update type.

hmpf commented 4 weeks ago

Howitz's howitz.endpoints.update_events runs zinolib.controllers.zino1.UpdateHandler.get_event_update in a while-loop.

It catches RetryError, which will never be raised.

If the event updater returns None, it breaks the while loop. This is what should trigger a new attempt and is the reason I suspect for the sudden loss of updates.

We should try:

if updated:
    updated_ids.add(updated)

instead of

if not updated:                                                                                                                                                                    
    break                                                               
updated_ids.add(updated) 

and stop catching RetryError.

hmpf commented 4 weeks ago

Some polish might be to have a counter for how many times we have seen updated == None and raise a more loud error/check the connection if it is above the magic number. On every successful update the counter is set to zero.

hmpf commented 4 weeks ago

Also, might be smart to have time.sleep(1) if updated == None.

hmpf commented 4 weeks ago

I'll make a patch.

hmpf commented 3 weeks ago

Curitz aborts because it attempts to receive from the API socket and the API socket times out. It does not fail because the NTIE socket dies.

hmpf commented 3 weeks ago

This is fixed in zinolib 1.2.0, that is: there is a new method zinolib.controllers.zino1.Zino1EventManager.test_connection() that will raise a TimeoutError if the server is not reachable. Howitz will have to run the method and do something with the timeout as per #115.

Curitz checks the connection every 60 seconds, this is hard coded.

hmpf commented 3 weeks ago

As per previous comment, closing this. When to run the test-method belongs in a separate issue.

hmpf commented 3 weeks ago

See also Uninett/zinolib#63