OctoPrint / OctoPrint

OctoPrint is the snappy web interface for your 3D printer!

https://octoprint.org

GNU Affero General Public License v3.0

8.28k stars 1.67k forks source link

HTTP server dying on multiple server instances #236

Closed tenaciousRas closed 11 years ago

tenaciousRas commented 11 years ago

Branch: master, Commit: b168ef7e7451bb12eea36bd8376bc7edd5243bc7

Terawatt is running multiple instances of the Octoprint server from one RPi in order to control multiple machines. Usually things work OK, but there have been random problems with the HTTP service dying on specific instances.

The Octoprint server stops responding to HTTP requests at apparently random times during sessions which consist of (at least): logging into one Octoprint server instance from a remote browser, connecting to the printer in Octo, then remaining connected and acting idle or by sending commands to the printer. The connection lost screen will appear, and the client cannot reconnect. Server logs provide no useful info.

When this happens:

Immediately SSH'ing to the server successfully, then running "ps -A" exposes the problem is on the server and the expected "python" process is not running. Restarting the server works, and things seem OK for a while (i.e. can connect and manage the machine), but then the issue repeats at an unexpected/random time.
If the print server is printing it continues until the end, but no HTTP connections are accepted. SSH'ing to the server shows the python process running. There seems to be no way to restart the HTTP service. In this case the print server has to be restarted when the print completes to reconnect and manage the server.

.octoprint/logs/octoprint.log has little logging info. It shows this: 2013-08-21 13:03:32,181 - octoprint.server - INFO - Listening on http://0.0.0.0:5000 2013-08-21 13:03:35,973 - octoprint.server - INFO - New connection from client 2013-08-21 13:12:57,230 - octoprint.server - INFO - Listening on http://0.0.0.0:5000 2013-08-21 13:13:06,747 - octoprint.server - INFO - New connection from client

When serial.log is enabled on the server - the 'tail' looks like this from the last occurrence of the issue:

2013-08-21 13:22:20,130 - SERIAL - DEBUG - Send: M105 2013-08-21 13:22:20,173 - SERIAL - DEBUG - Recv: ok T:62.5 /0.0 B:28.0 /0.0 @:0 B@:0 2013-08-21 13:22:22,195 - SERIAL - DEBUG - Send: M105 2013-08-21 13:22:22,262 - SERIAL - DEBUG - Recv: ok T:62.3 /0.0 B:28.5 /0.0 @:0 B@:0 2013-08-21 13:22:24,353 - SERIAL - DEBUG - Send: M105 2013-08-21 13:22:24,887 - SERIAL - DEBUG - Recv: ok T:62.0 /0.0 B:28.0 /0.0 @:0 B@:0 2013-08-21 13:22:27,832 - SERIAL - DEBUG - Send: M105 2013-08-21 13:22:30,125 - SERIAL - DEBUG - Recv: ok T:61.7 /0.0 B:28.2 /0.0 @:0 B@:0

We use commands like this to run multiple instances of the server:

start the webui

$OCTOPRINT_HOME/run & $OCTOPRINT_HOME/run --port=5001 --basedir=~/.octoprint5001 --pid=/tmp/octoprint5001.pid & $OCTOPRINT_HOME/run --port=5002 --basedir=~/.octoprint5002 --pid=/tmp/octoprint5002.pid &

The PID parameter is there for running as a service - but sometimes we execute the same command (for a given instance) from the CLI to restart the server when this problem occurs.

foosel commented 11 years ago

My guess is that something makes the tornado I/O loop die. I recently (in the devel branch, as part of this commit) added some code to log any errors that cause that to octoprint.log. Could you maybe try with the devel branch (see here for switching instructions) if you can reproduce the issue? We then should hopefully be able to get an idea what is happening that's making the server crash. With three concurrent instances on a pi I wouldn't rule out memory issues to be honest, but let's save the guess work until we have a stack trace :)

foosel commented 11 years ago

Btw, thanks for the very comprehensive bug report! :+1:

tenaciousRas commented 11 years ago

Thanks for the prompt response. That's exactly what we were hoping for - enhanced logging will hopefully get us a stack trace. I'll get it our RPi setup switched to the devel branch shortly.

Running multiple instances seems fine. There are other issues to investigate but this report is the most critical in my opinion. I can't assert whether 2, 3, or more is the usable limit either. It looks like a fresh Octoprint instance takes just over 50MB on our RPi. During usage it probably takes 50-100MB. I haven't looked at the impact from the visualizer and I don't know how much RAM is allocated to shared objects that can be used across multiple instances. There's also the whole socket.io framing when the client disconnects as per the protocol spec. We'll see...like you said we need a trace for this report.

tenaciousRas commented 11 years ago

OK - I installed OctoPrint-Devel.

I'm using the pmap -x command on the RPi to look at the memory footprint of the python/OctoPrint process for each instance. I'm now running either OctoPrint-Devel or OctoPrint (master) on port 5000 - but each instance has its own basedir. I have OctoPrint (master) running on 5001. 5001 is running an SD print. By the way we're running a Marlin fork that is just like Marlin.

Here's what's up. I can reproduce the issue with OctoPrint (master). This is done by 1) starting the server on port 5000 with the OctoPrint (master) code and basedir; 2) opening a browser to the port 5000 server; 3) monitoring RAM usage on the RPi with "pmap -x". The server starts and keeps eating RAM. After a couple of minutes it approaches 100MB. After several minutes it approaches 175MB. A few minutes later the process dies and the server has vanished.

Running the OctoPrint-Devel code on port 5000 with its own basedir doesn't do this. The basedir for my OctoPrint-Devel code was just copied from the original ".octoprint" folder, with log files deleted. The OctoPrint-Devel instance starts at ~52MB and stays there. When I connect to the printer and monitor its state "pmap -x" reports 58.6MB and stays there.

The server on port 5000 can crash, but the server on port 5001 keeps running. It seems that running an SD print on it also causes the memory footprint to be stable(?). The reason the servers haven't constantly crashed in the middle of prints is related to other niceties going on with memory and process management on the RPi/Debian jig. Etcetera.

Could it be a memory leak on the server in "master" that's fixed in "devel"? By the way, how do you pull upstream from Cura since you rearranged the files? Not that you should pull from upstream more than rarely, but just wondering.

foosel commented 11 years ago

Hm... It could of course be a memory leak, however I'd be a bit surprised that master and devel behave so differently here, since the changes devel currently holds after master actually should influence printing. There were some memory footprint optimizations for the gcode interpreter, it might be that your port 5000 basedir holds some extra large/complex gcode file that causes master to crash whereas devel can handle it just fine (since gcode analysis starts when the server starts but is always interrupted during prints that might actually be a pretty fitting explanation).

We could test if this is the cause by doing the following modifications to the code if you feel comfortable with fumbling around with that (btw, I couldn't test this out yet, won't be able to until tonight, but it might help you with mitigating the problem if I managed to get it right without a testrun):

Prepend a # to this line
Prepend a # to this line and this line and add a pass after each of them.

The idea as that the whole block of code should look something like this:

    #~~ analysis control

    def pauseAnalysis(self):
        # self._metadataAnalyzer.pause()
        pass

    def resumeAnalysis(self):
        # self._metadataAnalyzer.resume()
        pass

class MetadataAnalyzer:
    def __init__(self, getPathCallback, loadedCallback):
        self._logger = logging.getLogger(__name__)

        self._getPathCallback = getPathCallback
        self._loadedCallback = loadedCallback

        self._active = threading.Event()
        # self._active.set()

        self._currentFile = None
        self._currentProgress = None

This should prevent the gcode analysis from starting up on server start and also stop if from being automatically started when the printer is idle again after a print job.

tenaciousRas commented 11 years ago

I wrote my prior response without much usage. The OctoPrint-Devel branch's code crashes too. I'll check that out since it could still be something that's environment-specific. All I can say right now is there are two things I see with pmap -x that don't look right: a) what I described above; b) a server instance seems to run normally at around 130MB during a print and I have yet to see its memory footprint shrink, always grow and sometimes level off, then maybe grow further and level again - or not. I'm not sure what's going on. I don't intend to obfuscate this bug report into a discussion about programming languages or memory leaks, and I'm not a Python programmer although it's just another (arcane) programming language. Still, do you think that the stuff at links like http://www.lshift.net/blog/2008/11/14/tracing-python-memory-leaks and http://stackoverflow.com/questions/1435415/python-memory-leaks has any merit in relation to the behavior I've described?

foosel commented 11 years ago

I'll look into it. It would still be great -- if it got logged somehow -- if you could still post any stack trace you come across in octoprint.log on the devel branch when the server crashes.

I'm btw currently printing on a somewhat older install (haven't yet come around to updating that specific installation) for ~4hours now and memory usage has stayed quite consistently at 112MB. Might of course be something that got introduced between the last time I updated that installation and today, I'll update tomorrow and then try again if I can see the dreaded saw pattern. Also looking into profilers.

tenaciousRas commented 11 years ago

Sorry, I keep forgetting to mention that the logs are still not showing a stacktrace after seeing the issue happen on OctoPrint-Devel also. Is there a general log file for Python?

Yes the pattern I reported two days ago was rather dreaded. I didn't try to delete any files (and edit metadata) from uploads/ in that situation, and of course hadn't tried your suggestion to comment the analyzer. It may have been analyzing a bunch of gcode and hitting a limit along the way and not be a leak.

I'm trying the code patch. It needs more tests. After initial runs I think this patch might run stable if I don't upload files to any instances. Uploading a file to any instance causes problems. Is the gcode analyzer+parser invoked on a high priority queue when a file is uploaded?

I don't see RAM just growing on OctoPrint-Devel like I did before. I think the footprint is generally stable. The analyzer and parser seem heavy. Do they load the entire file and parsed gcode into the heap? I'm also wondering how long objects from the GCode analysis are around in the app's lifecycle...

foosel commented 11 years ago

The gcode analyzer and parser is invoked in a regular priority thread, and stopped when a print is started. The gcode to be analyzed isn't loaded into the heap as a whole but processed line by line. Doing so stores the information gained from the processing in memory (stuff such as extrusion time and extruded filament length), that's not much though, something like 10 floats, a couple of integers and -- this is the only thing really depending on the gcode file itself -- a list of both then unknown G and M codes encountered. Looking at the code I btw just realized a couple more things I could get rid off... In any case, this is the devel version we are talking about, master still contains a bit more irrelevant stuff that's been removed.

Would you happen to have a gcode file you can reliably crash the server with by uploading at hand? I'd like to try if I can reproduce the issue here and then trigger a couple of memory dumps to see what's happening.

My guess is the server crashes in such a way that it can't log any exceptions any more. You could maybe run it from console (not with --daemon), maybe in a screen, might be that it spits out something on stderr before it dies.

foosel commented 11 years ago

@tenaciousRas just to make sure, the problems didn't manifest by downloading a timelapse, right? Because there I just fixed something, see #193

tenaciousRas commented 11 years ago

It's been a couple weeks since I was 'in the trench' with this issue. I'm sorry I just read your message requesting a gcode file to reproduce the issue - any STLs from https://github.com/iquizzle/uconduit should work. Often I'm working with gcode that's 1-2MB or ~5MB.

I haven't pulled the update from #193 but I'll do that asap. Also I've run the app from the console without --daemon and there's no dump.

At this point I haven't been able to run multiple instances reliably, as the above issue always appears; so just one instance, and that's great since OctoPrint is slick! It would be far above the call of duty, but I'd be curious to know if you were able to setup a test fixture and had difficulty reproducing the issue.

All I know is, the app clearly runs out of memory and bails. The memory footprint doesn't exactly seem to correspond with some things you said above. Loading a file causes the memory footprint to grow significantly with or without the viewer, and I'd have to get back into the details but the memory footprint grows through the print and there are questions about (RAM) garbage collection, or something...

In my humble opinion, the best way to troubleshoot this would be with a memory analysis tool. I thought the RPi would have a rather large heap and be able to allocate SD card space for it, making out-of-memory a rare problem. I looked at Python MATs and they don't look promising, maybe Dowser stands out?

I expect to revisit this issue soon but don't have a timeframe. Let's close it for now and when we make further discoveries we can update its status if necessary.

foosel commented 11 years ago

I've tried dozer, it isn't compatible with tornado though. Then I managed to get RunSnakeRun to work and also added a rest resource to trigger memory dumps. But holy cow, those visualization tools are a mess (at least for someone coming from the Java world and being used to something like YourKit). So basically after spending most of the weekend trying to get anything useful out of the dumps (but without being able to trigger the issue) I gave up until more information (or better tools...) is available.