freedomofpress / securedrop-client

a Qt-based GUI for SecureDrop journalists 📰🗞️
GNU Affero General Public License v3.0
41 stars 40 forks source link

During reply download, seeing a lot of "Could not emit reply_download_failed: 'NoResultFound' object has no attribute 'uuid'" errors #1095

Open eloquence opened 4 years ago

eloquence commented 4 years ago

I encountered this during QA and I don't have clear steps to reproduce yet. What I've seen is basically the following:

  1. Populate Qubes 1.30 staging environment (server) with ~350 sources using qa_loader.py
  2. Sync using latest SecureDrop Client nightly
  3. While sync is running, perform some actions in both the Client and the Journalist Interface (delete, starring, replies)
  4. Towards the end of all messages/replies being decrypted, observed lots of log lines like the below in client.log, plus very high CPU usage for sd-app and temporary sluggishness of the Client.
  5. Ultimately, CPU usage and responsiveness return to normal.

Example log line:

securedrop_client.logic:705(on_reply_download_failure) ERROR: Could not emit reply_download_failed: 'NoResultFound' object has no attribute 'uuid

Before those log lines, I see groups of lines like this:

2020-05-22 14:52:22,239 - securedrop_client.queue:131(process) ERROR: NoResultFound: No row was found for one()

Environment type: Qubes staging Client version: 20200522 nightly Workstation version: 0.3.0rc1 Server version: Qubes staging 1.3.0

eloquence commented 4 years ago

I'm now doing the following to investigate:

1) Removed ~/.securedrop_client/data in sd-app 2) Removed ~/.securedrop_client/svs.sqlite in sd-app 3) Did not alter server contents 4) Did not attempt to reset sd-gpg keyring contents 5) Re-run client via its launcher script at DEBUG level 6) Not interacting with the client. 7) Monitoring logs

to see if the problem arises again at any point during the sync.

eloquence commented 4 years ago

Well, it must have either had something to do with the server state or with what I did during the sync, because on a fresh sync without interacting with the client, I'm not seeing those errors in the logs.

sssoleileraaa commented 4 years ago

Do you see more than one securedrop_client.queue:131(process) ERROR: NoResultFound: No row was found for one() log line per api job in the queue? Also I'm curious which jobs were the ones seeing this error. In the code it looks like we will only make an api call once if this is the error we get back, then we emit the failure signal and log the error - we don't retry a bunch of times for these kinds of errors.

This error message is a little confusing but basically it's saying that we can't tell the GUI that the reply failed because the reply no longer exists. So the reply widget will remain in the GUI until we see that the source no longer exists on the server (this will happen within a sync period).

securedrop_client.logic:705(on_reply_download_failure) ERROR: Could not emit reply_download_failed: 'NoResultFound' object has no attribute 'uuid

I still don't know why this would cause a CPU spike and make the client sluggish. If you have the full logs, that'll help, and we can try to repro next week.

redshiftzero commented 4 years ago

I think the following duplicate job scenario would describe the behavior:

  1. Sync starts, jobs corresponding to source A added to queue. They start processing but won't finish since there are so many sources.
  2. Sync happens again, more duplicate jobs added.
  3. Step 2 repeats every sync, adding more and more duplicate jobs until...
  4. User deletes source A. Any jobs that were enqueued for source A to download its replies and messages will now start failing (expected). Since lots of duplicated jobs can be added for each item to download, that would explain the many log messages.
eloquence commented 4 years ago

That reasoning makes sense to me; I can re-test the behavior with deletion during large syncs once #975 lands. Note that I didn't notice any user impact other than temporary sluggishness & CPU spike, which I'm not sure was related to this issue. I'll do another re-sync of the 350+ sources without doing anything else, just to observe the performance characteristics a bit more closely.

eloquence commented 4 years ago

Ran another 350+ sync, CPU for sd-app tends to be at ~5% during the message/reply download & decrypt operations, and spikes to ~20% after a sync (presumably when the sync results are processed by the client), and the client felt never sluggish during the entire operation. So, something -- the deletion or other operations -- in my previous run may have caused the sluggishness.

I'm curious if, in addition to the fix in #975, it may be worth removing jobs that have no prospect of succeeding (because a source has been deleted)? Both to avoid unnecessary network traffic, and any unwanted side effects from those jobs running.

redshiftzero commented 4 years ago

I agree, I think it is worthwhile, we don't have a way to cancel a job yet but I think that'd be a nice architectural improvement after #975

sssoleileraaa commented 4 years ago

if a source is deleted client-side, we already prevent any new jobs on that source. if a sync sees that something server-side has changed for that source (before the job+request to delete it finishes), then those jobs will be added to the main priority queue, after the deletion job. This is handled gracefully, but we could check before adding a job to the queue if the source exists locally first. if it doesn't exist, then we can assume the server just needs to catch up and can drop any new jobs for that source.

i think where we might see benefits from this architectural change is if the source is deleted server side. When the client sees that a source has been deleted during a sync, we could iterate through the queue to drop any jobs tied to that source. However a job could be in progress and in that case we'll still log an error.

redshiftzero commented 4 years ago

if a source is deleted client-side, we already prevent any new jobs on that source

true, but since any client-triggered deletion jobs will run before any already enqueued message or reply download jobs, even for deletion triggered locally we can run into the situation where there are pending message/reply downloads left around in the queue for deleted sources

sssoleileraaa commented 4 years ago

yup youre right - we could also see benefits from queue cleanup after a source is deleted locally in the case that you are deleting a source when there are a bunch of reply/message downloads in the queue for that source. it's great that we prevent new source jobs from being added, but any existing jobs in the queue will remain until they are processed (where they can decide to skip making the api request).

i just want to take a look back at https://github.com/freedomofpress/securedrop-client/pull/975 and how we can continue to make improvements in a similar way that keeps our architecture simple and coherent. the PR implements a way to check if a task is already queued up so that we don't enqueue it again. this was relevant because the client enqueues download jobs at the end of a metadata sync for any local db messages/replies where is_downloaded == False or is_decrypted == False or is_decrypted == None. before this PR change, we would have to wait until this criteria was no longer true before we knew not to create another download job. now we can know sooner because we'll see that a download job is already enqueued.

we did discuss at one point waiting until a metadata sync completes all sync tasks before enqueuing another metadata sync job, which would have also solved this issue. the path we're taking with PR 975 is more complicated but it allows us to continue to pick up any new source(s) and show that there are new encrypted messages to download.

the other benefit of PR 975 is that we can prevent the user from submitting a job more than once if they click on a button repeatedly. we currently don't have a need for this, but i could see it maybe coming up in the future if we added a refresh button or something like that.

so now we're talking about what to do about jobs that are no longer relevant in the queue. what makes a job no longer relevant? if we delete a source, then we know that any jobs in the queue that act on that source (e.g. starring a source) will gracefully fail and the gui will respond accordingly. another scenario might come up if say we allow deletion of messages and also have a "mark as read" feature or something like that. We could drop the "mark as read" job if the message is deleted before it is processed. but how useful is this queue-cleanup feature? the examples i've provided so far are not really an issue because processing them will quickly resolve things, and we can already avoid making api requests if a source/reply/message no longer exists. i think the benefit will only be seen for a very specific edge case: you are syncing the client for the first time and there are hundreds of message and reply download jobs enqueued for that source, and then the source is deleted. also we could prevent a bunch of failure signals from returning to the gui because the jobs would never be processed (we actually might want these failure signals to occur for any user-action jobs that were dropped though).

even though i think there is limited value in queue cleanup (please provide more examples if you think i'm wrong) there is an elegant way we can implement this by creating a new high-priority job, perhaps called QueueCleanup, that we enqueue whenever we delete a source from the database. That job could iterate through the queue and remove jobs associated with that source. This would also require that we keep source uuids on the download jobs as well as reply/file/message uuids.

zenmonkeykstop commented 1 month ago

This is still a potential issue.