Open szanati opened 8 years ago
You will need to compare the memory after the ingest finishes. I don't know how large of a file you can upload to Ripple but that would be a good way to test without impacting production.
I suspect that when the ingest is finished that the memory will back to normal - a lot of space is going to be used while the package is being processed and sent to the archive. 5 GB uploads is pretty wild. :)
In the case of large downloads Passenger Fusion was buffering the whole request and then sending them - very expensive on memory.
I have noticed that load average goes up also but goes back down to a normal level without having to restart DAITSS but the memory does not. I haven't let the memory stay to long at the high level because I am afraid it will become unresponsive. I did yesterday leave it to see what happens after the ingest of the package, causing the memory spike, archived. The memory still was at the same spiked level a while after the package was done with only fixity running. I will try testing a small package on ripple due to the fact we limited the gui package up limit to something like 1MB or so.
Sounds like CRUFT in the upload directories. Restarting Daitss should clear up some of that stuff. If you have any recent uploads since your last restart check if anything is laying around that shouldn't be. Search for any of those package names in the daitss temporary folders. Look in /var/daitss/tmp
There have been problems with submission clean up in the past. Here is an old but open ticket: https://github.com/daitss/core/issues/509
Now that you mentioned the /var/daitss/tmp directory xymon the past 1 or 2 have been complaining about that directory also. It gives me: Filesystems NOT ok &red /var/daitss/tmp (92% used) has reached the PANIC level (90%). There is files called for example for today: RackMultipart20151204-14530-1crr8c. The seems to be alot of the RackMultipart type files going back to 2013. I never messed with that directory it was always a developer that would mess with it if needed. I did notice after a while the folder level would go back below the panic level by itself.
It can be dangerous to mess around with as its used in the ingesting process. A safe bet is to disable GUI submission, allow packages to finish ingesting and then stop daitss and check the directory to see if it gets cleaned up.
If the usable space looks adequate I wouldn't mess with it any further. Otherwise, check for any rackfiles and package cruft and clean them up manually. I think that should be okay.
It could also be necessary to increase the available space of /var/daitts/tmp especially in the light that affiliates are uploading such massive packages.
I a new there was a reason DAITSS operators didn't mess with that tmp directory. I have just left it alone its back down to 71% full. I guess it recovers by itself. If it continues I will get with a syst admin to add more space. Thanks for the help.
DAITSS crashed sometime Sunday due to this problem. One of our Affiliates had uploaded 4 packages via the gui. They were between 5 and 7GB each. DAITSS was brought back up on Monday. Memory started to spike again around noon today. There were only 2 packages left that were ingesting. They totaled 12 GB. Fixity was not running. I did notice the same affiliate was accessing the gui just before that time.
I was looking at emails from October 2014 that had a similar issue but with down loads. Here is some of the emails: darchive web server bloat 10/6/2014 "Is there any way we can limit the memory use on the web service that caused the server to halt on Friday? I assume that is was passenger phusion?"
"The process that shows the memory usage is httpd. I do not think we want to stop that process. Is there anything in the DAITSS logs that shows what the issue is?"
"I’m feeling less confident about this but the only thing I can see that may have caused the problem is an 11gb package download through the GUI from an affiliate. The user double clicked the download link at 3:33. Stephen restarted daitss at 3:47 – I believe because the service was locking up. I don’t think the double clicking would have caused the issue but I am noting it because it looks to have been handled by a separate process. I have seen this before.
Would it make sense to have apache self-destruct before it locks up? The worst case scenario is snafued packages which can be restarted."
"So, we did a test. We successfully downloaded a 1.5 GB file through the GUI with no problems yesterday. We tried to download a 11 GB file with no other packages running and it seems to have sent the memory on the server into a death spiral. We closed the file download process but it seemed to do no good.
I need to do some more research but do you have any thoughts on why a single resource request (albeit a large one) would cause such a negative response?"
"A quick note before I leave for the day. The route we use to send the file to the browser is very straight forward. But the problem is essentially just that. Any time we do a ‘send_file’ to the browser Rack sends the entire file in one go. There are ways to ‘stream’ the files – Passenger Phusion is supposed to help us do just that. I have a hunch something broke – or we failed to implement this properly.
Even if using Passenger Phusion we are supposed to hook into an Event Machine using some form of Streaming API. There are a handful to choose from. I don’t see us doing that sort of call anywhere in the code. This will take a good deal of investigation and testing."
"We just finished disseminating a 1 GB package on ripple. I am going to run some tests to see what kind of memory usage is needed for the package. Then I will begin experimenting with some code to stream the file download process and see how this impacts memory usage. In short – we are looking at a way to stream package downloads for disseminated packages through the GUI in order to reduce the load on memory and prevent rack and apache from killing the server due to large file downloads."
"I’ve done a lot of research into the problem and ultimately ended up having to run some basic tests on ripple versus my VM. We disseminated a package on ripple and noticed that every time we downloaded the package more and more memory would be used up. Eventually it would begin to eat into Swap memory and gradually eat up swap memory with each new download. My VM did not have this effect – but my VM is running web services under Thin. Hrmmm….
The attempted solution… https://www.phusionpassenger.com/documentation/Users%20guide%20Apache.html#PassengerBufferResponse
This is set to ‘on’ by default.
I set this to ‘off’ in core and voila – CPU cycles go up but memory does not. Idles at a lovely 0.1% of memory. I believe we’ve been buffering all of the user responses and never letting them go. There may be a way to clear the buffer manually from ruby but it would require taking advantage of the passenger phusion gem which we do not currently implement. Also, this would make DAITSS less web service agnostic.
This brings up a good question – is this the root cause of all of our recent memory problems with daitss? During my recent tests I never saw the memory return to previous levels after having downloaded a package. So over a long period of time the server would run out of memory and crash leaving us scratching our heads. More recently we’ve seen affiliates downloading their own disseminations which can explain why the memory issues have been coming up a lot more frequently.
Let me know what you think. Stephen will run some benchmarks to see if this configuration has any impact on processing times."
"Just for a bit of clarification – the documentation is a bit confusing and says it defaults to off. I am pretty sure this is wrong based on what I’ve tested. It could also be different for our version of phusion. "
"The documentation suggests this value is ‘off’ by default. Do you mean our configuration has it enabled?
I’ve also been reading on various methods to enable ruby process profiling, which is complicated by our mod_passenger setup. Do we want to look at enabling one of the profiling gems on ripple to see how it works with normal packages?"
"Well, we weren’t setting this option in the configuration file for core. But setting the option to off seems to have the desired effect. I wonder if it is getting set ‘on’ somewhere else or is just wrong in the documentation."
"Well – let me enable the value to ‘on’ explicitly and see if it has the negative effect."
"I set the value to on and am downloading the 1gb file. Look at httpd memory consumption. See how it jumped up? "
"As per discussion with Jonathan, I agree that it appears that mod_passenger is buffering by default on our system. Disabling it on ripple causes the processes to use much less memory, but more CPU time. This is going to be a trade-off that Jonathan and Stephen will benchmark before going forward.
Setting the option in the core config should only impact core (eg, the GUI), and leave the other Ruby processes unaffected.
We may still want to look at improved profiling gems, setting per-process memory limits, and/or lowering the apache maxclients and maxrequestsperchild values, but not until the buffering change benchmarks are complete."
Could the issue in the above emails be the cause of the current uploading causing memory issue?
I don't think its related. Passenger Fusion was improperly configured for dealing with resource requests (downloads) and was buffering and storing all of those requests over time. For example an 11GB package download was resulting in 11GB temp file being created and left on the system.
I think what Daitss is experiencing right now is a case of too much load on the system. To my knowledge there is no way to throttle how much a user uploads except how large of a package through the GUI. You could lower the size limit of uploads and that would alleviate the server somewhat. This limit is located in the daitss config. One of the sysadmins should be able to help with that change. Unfortunately the end user will just upload more packages but it may allow the server to catch up between each package.
What needs to be done would require a decent bit of development. The system needs to throttle how much data a particular user or organization can upload and process at a time.
Thanks Jonathan. The quickest solution would be to have this particular affiliate to start using the ftp again. They have had issues in the past with the ftp and also the turn around time. We are working on trying to get them to use the ftp again. The problem started about a month ago when they started sending us more files. The packages that I submit doesn't give me nearly the same memory issues as the gui submissions of late.
The only other thing to do is continue to monitor the memory usage. Things to watch for: how much time elapsed between each of the affiliates uploads? Any correlations between memory usage and the number of uploads? Did any package finish before starting the next? Were any ftp packages running along with the GUI submissions?
To answer the last question, The ftp packages I submit at the end of the day well after the packages were done uploading via the gui and almost done archive. The only thing last week that was running was fixity but I had stopped it during one of the memory issues last week and the memory still stay high.
I have notice that there is a memory issue while affiliates are uploading packages via the gui. It seems really noticeable when the package is 5GB or greater. Xymon starts to send warning for example: red Fri Dec 4 10:01:23 EST 2015 - Memory CRITICAL Memory Used Total Percentage &green Physical 40025M 40217M 99% &green Actual 22929M 40217M 57% &red Swap 14659M 16383M 89%
The above had no other packages ingesting fixity was running though. The memory seems to stay that way even after the package is already uploaded and ingesting and even yesterday after the package was archived. I have to stop and start daitss to get the memory back to normal. We had a similar issue with downloads that Jonathan fixed but now it seems to be the other way. It might have been happening all along but most of the package that were uploaded via the gui have been well under 100MB. The past week or so one of our affiliates has been uploading 5GB or greater packages which causes the above memory issue. It seems to be maybe a process sticking?