SeattleTestbed / softwareupdater

Software updater daemon
MIT License
1 stars 6 forks source link

Software updater CPU use at 100[pct] #9

Closed choksi81 closed 10 years ago

choksi81 commented 10 years ago

We had a bug report from Justin P. Rohrer today that said the following:

I've been running Seattle on my mac laptop since GEC 4 with no issues, until today when the softwareupdater.py script started using 100% of my CPU. If I kill it it just restarts after a few minutes and does the same thing. Do you know what could be causing this? Coincidentally I was also pushing Seattle out to the GpENI nodes today via planetlab slices, but I didn't touch the installation on my machine.

This is a critical problem which we need to address.

choksi81 commented 10 years ago

Author: jsamuel The user reported that the problem went away on its own overnight (they hadn't used stop_seattle.sh so the updater started up again when they weren't killing the process). They now report that they have version 0.1j (the new version that would have been being updated to).

justinc responded to the user and said one possibility is that it is just high CPU usage during the update. Testing locally on my laptop, it takes between 15 and 20 seconds of high CPU usage for signature checking and hash checking to be done, most of the time being in the hash checking (for each file listed in the metainfo, two hashes are taken: one of the downloaded file to make sure it matches the hash in the metainfo, and one of the local file to see if it is already the same as the downloaded file).

Of course, it's likely that on systems slower than my laptop that the update could take longer than 20 seconds. My laptop fan becomes loud for the period of the update, so in the case of a user with a laptop it is quite noticeable that something unusual is happening.

choksi81 commented 10 years ago

Author: justinc Let's call this an invalid ticket for now. If we find more information some other time, we can reopen this ticket.

choksi81 commented 10 years ago

Author: jsamuel Re-opening as the user ran into the same problem with the update to 0.1k.

I am now back to the earlier problem, where softwareupdater is using 100% of my CPU, only now it is running two instances, both trying to use 100% of one CPU. I let them run overnight, but had to kill them to use my laptop this morning. After killing both instances, one gets restarted every few minutes.

Having to run stop_seattle.sh to prevent the software updater from restarting is expected, but it is definitely not expected to have two instances running for more than a few minutes at a time (when the old instance starts a new one and waits for it to take over after an update of the software updater, or when the cron'd start_seattle.sh starts a new copy that then notices there is already a copy running).

It is also, of course, not expected that the CPU usage would be continuously high. Even a slow system should probably take no more than a few minutes to calculate hashes and check signatures.

We've asked this user to provide the software updater logs. The current logging done by the software updater may or may not provide the information we need (see #551).

choksi81 commented 10 years ago

Author: jsamuel The user provided their softwareupdater.old log, which showed two different periods of behavior. The logs start at 1247398801 (which was a few days before the 0.1k update was pushed) and then a switch to the normal log messages which appears to happen this morning, probably at the time the user killed both of the running software updater processes.

1247540401.24:PID-11752:I/O operation on closed file
1247541001.15:PID-11772:I/O operation on closed file
1247541601.75:PID-11795:I/O operation on closed file
1247542201.43:PID-11825:I/O operation on closed file
1247542801.26:PID-11848:I/O operation on closed file
1247543401.09:PID-11874:I/O operation on closed file
1247544001.72:PID-11905:I/O operation on closed file
1247544601.48:PID-11928:I/O operation on closed file
1247545201.12:PID-11954:I/O operation on closed file
1247545800.85:PID-11986:I/O operation on closed file
1247546401.55:PID-12010:I/O operation on closed file
1247547001.29:PID-12037:I/O operation on closed file
1247547658.83:PID-12066:I/O operation on closed file
1247547934.22:PID-12082:[match']('Timestamps)
1247548210.99:PID-12102:[match']('Timestamps)
1247548210.99:PID-12102:Another software updater old process (pid: 12082) is running
1247548813.39:PID-12130:[match']('Timestamps)
1247548813.39:PID-12130:Another software updater old process (pid: 12082) is running
1247549411.15:PID-12162:[match']('Timestamps)
1247549411.16:PID-12162:Another software updater old process (pid: 12082) is running
1247549918.6:PID-12082:[match']('Timestamps)
1247550011.24:PID-12188:[match']('Timestamps)
1247550011.26:PID-12188:Another software updater old process (pid: 12082) is running
1247550618.36:PID-12216:[match']('Timestamps)
1247550618.39:PID-12216:Another software updater old process (pid: 12082) is running
1247550972.0:PID-12082:[match']('Timestamps)
1247551210.94:PID-12243:[match']('Timestamps)
1247551210.94:PID-12243:Another software updater old process (pid: 12082) is running

I'm not sure what the cause of the "I/O operation on closed file" message is.

The reason that there are 'Timestamps match' messages right before messages about another process already running is that the init() function in softwareupdater.py does a dry run including a do_rsync() before checking if there is another process that already is running. In normal operation when there is not an update, this would mean a single file download and signature check (the metainfo) every time start_seattle.sh is run by cron.

choksi81 commented 10 years ago

Author: jsamuel The "I/O operation on closed file" was #550, already fixed.

choksi81 commented 10 years ago

Author: justinc What is the status of this? Do we have an update?

choksi81 commented 10 years ago

Author: jsamuel Justin Rohrer took the time at GEC5 to look at this with me and the problem couldn't be repeated. What I told him we'd do is warn him before we launch our next version so that he can keep an eye out for the issue, as the issue seems to come up during an update. If the problem doesn't happen with the next update, then we can close the ticket.

choksi81 commented 10 years ago

Author: jsamuel Haven't had any further reports of this. Closing ticket until/unless reported again.