NAVADMC / ADSM

A simulation of disease spread in livestock populations. Includes detection and containment simulation.
Other
10 stars 5 forks source link

How to handle out of memory #941

Closed missyschoenbaum closed 4 years ago

missyschoenbaum commented 5 years ago

I've finally done it. I've run out of memory. My PC gave me a separate message, so I had a clue that something was going to go wrong. However, the app keeps chugging along. It is throwing memory errors into the logs. Eventually it finished, and will even show some results.

Is there any way we can grab this error and let users know? What can I give you to help? I have logs, stackdump.

ConradSelig commented 5 years ago

The logs and the stackdump would be great. If they are too large to be uploaded here then can you put them on Google Drive?

missyschoenbaum commented 5 years ago

Do you want only the logs that failed? Some didn't.

missyschoenbaum commented 5 years ago

This is stackdmup, zipped adsm_simulation.exe.zip

ConradSelig commented 5 years ago

If you have all the logs from the run where it ran out of memory that would be awesome.

missyschoenbaum commented 5 years ago

MemoryError_matchesStackdump.zip Bryan gave us these tools, which are coming in handy.

ConradSelig commented 5 years ago

First off, these logs are both incredibly useful and incredibly frustrating. The amount of information you can pack into ~2662 columns is unbelievable (actually totally believable, iteration1.log has ~105,033 values!). On the other hand, making sense of that data is very difficult when the information you are looking for is not in said data. There is no column titled "hasNoMemory" that tells you the exact point in the program that memory issues were encountered. However I did find a few things that raised some red flags for me. @missyschoenbaum, maybe you can provide some incite for us on what this means to you? Here is one case:

"expcAIndSwine(L)" is a column that the output wiki describes as describing the cumulative exposed units (in this case Latent swine). In iteration1.log there is no activity in this column (meaning the count is 0) for day 1 and 2, but on day 3 the count jumps to 100,077 and by the last day the count is 101,300. Looks like you've either got a HUGE simulation (I suspect this because it is a memory issue) or multiple units are being exposed multiple times on the same day (unlikely, I'm not even sure if this is possible). In iteration2.log however, the count in this column remains 0 for days 1-10 before jumping up to 264 on day 11, 20,702 on day 12, and 21,011 from day 15-end. These numbers are VERY different. I would expect similarities between iterations, some difference yes, but this big a difference? I'm not the expert.

I'm not sure - using this data - that we could detect memory issues. We might be able to predict memory errors or even warn the user that memory errors could/might be occurring, but I'm not sure we would ever be able to be 100% certain. We MIGHT be able to probe the user's system for memory-usage related data, but I'm not sure we want our program to do that or if that sort of thing is even possible. Bryan would have to advise.

missyschoenbaum commented 5 years ago

I was trying to run a big outbreak, because I wanted to test #455, which works great on tiny outputs. I think now I am in a spot that even a small outbreak causes an out of memory. I can capture one of those instead if needed.

Also, on this big population I have about 20 farms that are bigger that 100,000 units, so that count you see just means we hit the jackpot and got one of them early on.

BryanHurst commented 5 years ago

This looks like a reappearance of the memory issue in #841

It is solidly in the CEngine and how it is handling arrays and can't easily be caught or handled in the frontend.

I'll think a bit on this one again.

BryanHurst commented 5 years ago

We should be able to watch the return code of the CEngine; if it crashed, run the Abort function. Maybe close application after.

ConradSelig commented 5 years ago

Some good news and some bad news.

Bad new first: automatically aborting the scenario without the button press does not seem possible. This is due to how the abort button is interrupting the simulation and how it cannot be replicated in the same place the memory error interrupt needs to happen.

Good news: I can detect a memory error occurring and display this message on the results screen:

image

That entire red banner will only appear if a memory error occurred in the last run, meaning the results will need to be deleted and the scenario re-run before it will go away.

@missyschoenbaum if you are happy with this solution and the working go ahead and send to testing, otherwise we can make the changes needed.

missyschoenbaum commented 5 years ago

Unless @BryanHurst thinks of something better, this may be the best we can manage. Before I rework the error message, is this the array failing, or would something with more memory manage it? Please give easy explanation, as I am not good at this stuff. I know I am running an Intel i7 with 16GB of RAM, and 4 cores. I know the GIS lab has 32GB of RAM and 32 cores, so in theory I could go in there and run OK?

BryanHurst commented 5 years ago

@ConradSelig I can't do much testing on this because the migration for the new had_memory_error isn't committed.

However, I'm pretty sure there is a way to abort everything as soon as we detect the first memory error.

I'll poke at this one once the migrations are in the repo.

ConradSelig commented 5 years ago

Whoops, my bad. Here you go!

BryanHurst commented 5 years ago

@conrad I'm going to change the migration that you just pushed. Since the two of us are the only ones that have that migration yet, I'm just going to revert it. So either migrate your ADSMSettings back to 0009 or delete any scenarios that have that migration.

BryanHurst commented 5 years ago

@ConradSelig I added a new "crash_text" field to the SmSession model that can have a more detailed error specific message.

Can you take that, pass it into the results template file, and have some sort of "Click for more information" link in the big red error box that displays that text in a modal, popup, or hover text (whatever you see fit).

ConradSelig commented 5 years ago

@BryanHurst I like the changes, definitely more robust and flexible. Were you not encountering issues with simulation_status.json when aborting the simulation? Just want to make sure as that was what I was running into when trying to use the abort_simulation() function in simulation.py. If not, move to testing!

BryanHurst commented 5 years ago

No, the simulation_status returns properly. There is a large error dump of "broken pipes" when the simulation aborts, but I think this is fine.

There is a problem with the population map continually trying to calculate. I'll look into that before pushing again.

BryanHurst commented 5 years ago

@ConradSelig I've cleaned up my error handling portion of this, and we are now properly aborting the simulation at the first failed iteration.

The big error box isn't quite working for me, so I'll pass this one back to you.
Currently, the bar spans the whole screen and the text starts out centered. However, after the page does some more loading/rendering, the bar still spans the whole screen but the text is squished over to the right.
I also cannot click to open up the extra error message.

Let me know if you need a screenshot.

ConradSelig commented 5 years ago

The sizing of the error banner is intentional, this is because sometimes when the simulation is aborted the "simulation running" screen continues to display for longer than expected. The banner however should display on this screen much faster than the entire page update. This is so the user knows why the simulation is aborting.

Once the final results screen is rendered the text is moved over to the right so it is not hidden under the map that is generated. The banner continues to span across the entire page but is hidden under the map. I could compress down the width of the banner, but this would require another block of code - and I believe it is good programming to avoid mostly duplicate code.

I can see that the link is not working however, so I am going to go ahead and work on fixing that.

ConradSelig commented 5 years ago

@BryanHurst Give some thought to my explanation on why the banner is being displayed like it is. If you are OK with that move to testing - otherwise we can look at possible changes.

BryanHurst commented 5 years ago

@ConradSelig for me, the banner doesn't get covered up by the map (probably because I'm aborting early in my tests and no map is generated).

So it would be best to either make the banner always on top of everything or resize the banner over to the right so it doesn't look odd with a lot of blank red space once the text is moved over.

missyschoenbaum commented 5 years ago

Question here. This is what the performance looks like when running scenario that fails on a machine with 32GB of memory. I understand that app uses most of CPU, because it grabs whatever it can and multi-threads it. I don't understand why memory is not used in excess. 5it This was with 5 iterations completed.

This is what it looks like when scenario runs long, but doesn't fail.

2ndround5done

missyschoenbaum commented 5 years ago

Also on banner, it says click here for more info. Nothing happens when you click. Mine keeps spinning until I return to inputs, and change delete and save changes.

missyschoenbaum commented 5 years ago

I got multiple images while I was on 32GB machine. Let me know if I need to post more.

ConradSelig commented 5 years ago

The memory usage is quite interesting... I've admittedly never watched task manager very closely while running simulations, I'll give it a go next time I'm writing code and see what my computer is doing.

Looks like we still have a missing popup error - hmmm. I'll look into it. For reference, the entire banner should be the button - so it's not like you're just not clicking on it or anything.

ConradSelig commented 5 years ago

@missyschoenbaum I am not having any issues clicking on the banner and getting the popup so I've added some stylizing to the banner to hopefully make it a little more clear what is going on on your end. Can you post a screenshot of your mouse hovering over the banner? Just want to sanity check both of us! Obviously no need if things start working.

ConradSelig commented 5 years ago

@BryanHurst Can you put this change in the next build for Missy?

ConradSelig commented 5 years ago

Here is a link to a video Missy recorded of her computer encountering a natural memory error.

https://photos.app.goo.gl/x3JsE5KChgPRN9NU8

ConradSelig commented 5 years ago

Scratch the screenshot request @missyschoenbaum - It doesn't look like any of the popups are working, which also means replacing the population in the current build is not possible. I'll come back to this once I have #959 fixed.

ConradSelig commented 5 years ago

@missyschoenbaum I don't really expect this change to work - but I'm at my wits end for why this isn't working. I think it might be a chromium issue, which means that fixes made as part of #961 could fix issues here as well.

I'm moving this to testing with the full expectation to be seeing it again, but I don't think I can make any more progress on it before confirming that it's not a viewer issue.

missyschoenbaum commented 4 years ago

OK, I am getting the finger like I can click anywhere, but click does nothing. Where are you thinking we should send them? @ConradSelig

ConradSelig commented 4 years ago

@missyschoenbaum I think this might actually be something to do with pop-ups in the viewer. Are you having issues with any other pop-ups in the program? Try the "Replace Population" button on the Population screen - if that one is not working then we know that the pop-ups are the problem and not the memory message.

ConradSelig commented 4 years ago

Looks like red boxes are also having their text get cut off at small view window sizes. Here is a screenshot: image

missyschoenbaum commented 4 years ago

@ConradSelig Replace population seems to work of. It was visually OK, and also the buttons worked as expected, popupreplacepop

ConradSelig commented 4 years ago

I'm going to look at this again after #961 as that might change action.

ConradSelig commented 4 years ago

I found an in-continuity in the code. In addition to the new viewer application - this should be tested again.

missyschoenbaum commented 4 years ago

Will do.

missyschoenbaum commented 4 years ago

Manged to blow out memory today. Have run updates, so current. Still issue that red box doesn't respond at all, have to close from X. If I close CMD first, it doesn't close front end.

BryanHurst commented 4 years ago

@ConradSelig Just a couple of things on this one still.

When the window is small, text in the red box is cut off.
image
Can this have scroll capability?

When you click on the red box to see more (which is working now), the popup is a little odd.
The top right close button is in a slightly weird spot (low priority) and the Close button looks like it isn't clickable though it is.
image

ConradSelig commented 4 years ago

@BryanHurst You said the button is hard to see, and I agree.

However, at least on my build, it also matches all of the other popups we have.

Do yours look different? Could I get a screenshot?

BryanHurst commented 4 years ago

@conrad, this is the latest code pulled and the two issues above still exist.

image

BryanHurst commented 4 years ago

I tweaked the button to make it look clickable.

As for the 'x' button, it is that way on all modals, so lets ignore for now.

BryanHurst commented 4 years ago

Doing some testing in the Viewer, it looks like this is opening now.

missyschoenbaum commented 4 years ago

Yeah! Red box responded. 2 more tweaks, the error message says either add more memory or decrease size of scenario. So, as I understand it, this is an array size problem not a memory size problem, correct? Could we reword error to something like "To avoid this error in the future, decrease the spread in your scenario."

Also, could we make the Close button on the error close the app?

missyschoenbaum commented 4 years ago

RedError

ConradSelig commented 4 years ago

@missyschoenbaum You are sure you want the action of closing the app when they press close? We could also redirect them to the scenario creator and not the results page.

ConradSelig commented 4 years ago

I did some digging into both options - turns out closing the program all together would be more difficult to do than I initially expected, if that really is the path you want to take however, we can figure it out.

I did get it to work that the dialog will return the user to the Scenario Creator, note that this button works exactly like the "< Back to Inputs" button (seen in the very bottom left of the screen).

@missyschoenbaum thoughts on this? I'll wait to push my code until I hear back from you.

image

missyschoenbaum commented 4 years ago

@ConradSelig I agree back to Inputs is a good action. Go ahead with that.

ConradSelig commented 4 years ago

@missyschoenbaum What exactly do you want the simulation error pop-up's wording to be?

missyschoenbaum commented 4 years ago

"This scenario has exceeded the limits of ADSM. It will be necessary to modify your parameters to reduce disease spread to execute this scenario."

Sound reasonable? We can't guess what they need to change. @ConradSelig