CVNRneuroimaging / infrastructure

Issue tracking, system documentation and configs for operations side of the neuroimaging core @ Atlanta VA CVNR / Emory University
3 stars 2 forks source link

Rama Problems/Space #163

Open simonero opened 8 years ago

simonero commented 8 years ago

@kmcgregor123456 & @rrmm ,

Few things. Help would be greatly appreciated.

  1. Rama is having problems with being intermittently slow again.
  2. I ran out of space on localdatarama1. I noticed that Stephen has half a TB on localdatarama1 still. I moved this, and a few of my folders, to localdatarama2. --- 2a. Can we do something about Stephen's image of pano or store it elsewhere? --- 2b. Is localdatarama2 fixed? I want to know if it's OK to use it exclusively going forward for one of my studies, and use localdatarama1 for my thesis.
  3. I keep getting some memory errors while trying to debug the program I wrote. Is it possible that this is due to something with rama? I've been trying to find the source in my code, and it still could be my code, but I'm confused because my code seems to be fine and it was working perfectly prior. The error is associated with a specific program that I execute serially many times. Also, I don't get the same error for small datasets for some reason, and it only happens sometimes within a single run of my program (20-80% of the time it works). I've been working on this for weeks now to no avail so I figured it's time to ask for help. ((Examples of the errors below.)) --- 3a. If these errors are more likely code-based, any insight as to what generally could cause these problems is appreciated, and would help me figure out the source. The program I'm calling that produces this error was written in C++. --- 3b. If I can test this script on another machine to see what happens that might be helpful. It runs for days and takes up a lot of CPU/memory/temp space, so let me know.

Thank you!!

image

image

rrmm commented 8 years ago

@kmcgregor123456 & @rrmm ,

Few things. Help would be greatly appreciated.

  1. Rama is having problems with being intermittently slow again.

try running 'w' at the command line. there seems to be a lot of session logins; make sure you log out from any that are not in use and close any x2go desktop sessions that are not actively in use.

  1. I ran out of space on localdatarama1. I noticed that Stephen has half a TB on localdatarama1 still. I moved this, and a few of my folders, to localdatarama2. --- 2a. Can we do something about Stephen's image of pano or store it elsewhere? --- 2b. Is localdatarama2 fixed? I want to know if it's OK to use it exclusively going forward for one of my studies, and use localdatarama1 for my thesis.

I think you should be able to use it, but get an answer from Keith directly.

  1. I keep getting some memory errors while trying to debug the program I wrote. Is it possible that this is due to something with rama? I've been trying to find the source in my code, and it still could be my code, but I'm confused because my code seems to be fine and it was working perfectly prior. The error is associated with a specific program that I execute serially many times. Also, I don't get the same error for small datasets for some reason, and it only happens sometimes within a single run of my program (20-80% of the time it works). I've been working on this for weeks now to no avail so I figured it's time to ask for help. ((Examples of the errors below.))

It seems more likely that the errors are stemming from the program. If several other unrelated programs were dying at random, it might point towards a more general problem on the machine.

--- 3a. If these errors are more likely code-based, any insight as to what generally could cause these problems is appreciated, and would help me figure out the source. The program I'm calling that produces this error was written in C++.

generally: -double-freeing memory (deleting or freeing an already freed allocation) -using previously freed/deleted memory -corrupting memory by overwriting it due to a bug -using a pointer that doesn't point to anything valid (after free/deleting an allocation, make sure no one is holding a reference to it; make sure allocation actually succeeds).

If your program is actually running out of memory, then some allocations will fail. If you are not catching those failures immediately, when you try to use that memory, the program will likely crash or corrupt other memory.

I would also monitor the ram usage of the program during run and ensure the program is acting as you expect.

--- 3b. If I can test this script on another machine to see what happens that might be helpful. It runs for days and takes up a lot of CPU/memory/temp space, so let me know.

you might try on a system with less memory (in a virtual machine or whatever) and see if it crashes sooner. you might also check and see how available pano is.

simonero commented 8 years ago

@rrmm Thank you so much for all of this useful information!!!!! I may enquire about some more of your comments but as a starting point I've picked a couple to ask about:

try running 'w' at the command line. there seems to be a lot of session logins; make sure you log out from any that are not in use and close any x2go desktop sessions that are not actively in use.

Wow, I didn't realize I had so many windows open. Most of them are idle but tracking progress on things that aren't going as smoothly as I had hoped, for example this program!!

I closed a bunch down, but it appears that there are more users/logins than I have windows, after excluding the "--init user" processes for Jonathan and I (I know nothing about them, but they sound very fundamental). Specifically, there are 33 "users", 29 terminal windows, and 2 "--init user"s... Does that mean something?

(yeaayea I'll close more down, just need a chance to make a detailed record of what I've done first!)

I would also monitor the ram usage of the program during run and ensure the program is acting as you expect.

I've tried my very best to do this manually (literally sitting at my computer for a few hours watching the screen and checking the memory while I multi-task work... I live an exciting life) but the lowest I've seen it go is ~500MB free. The error has also occurred when I have over 1.2GB free. Do you know of a way to record the memory usage at a set interval so I can scroll through the output and look for memory spikes?

@kmcgregor123456, q#2 up top is still for you. Gracias por tu ayuda in advance. Also, how is pano lookin' these days? Would it be alright if I took her for a spin on PanTrack?

kmcgregor123456 commented 8 years ago

Go ahead and try on Pano. I'm just doing some clean-up analysis on it right now.

We have a Simone data contingency, so there are additional disks there in WMB for you to use. There's a 5TB disk that can be added to the dock. It will mount under /media

Keith M. McGregor, PhD VA RR&D Atlanta CVNR Emory University, Department of Neurology 352.359.8084 http://www.varrd.emory.edu


From: simonero notifications@github.com Sent: Tuesday, March 22, 2016 3:34 AM To: CVNRneuroimaging/infrastructure Cc: Keith McGregor Subject: Re: [infrastructure] Rama Problems/Space (#163)

@rrmmhttps://github.com/rrmm Thank you so much for all of this useful information!!!!! I may enquire about some more of your comments but as a starting point I've picked a couple to ask about:

try running 'w' at the command line. there seems to be a lot of session logins; make sure you log out from any that are not in use and close any x2go desktop sessions that are not actively in use.

Wow, I didn't realize I had so many windows open. Most of them are idle but tracking progress on things that aren't going as smoothly as I had hoped, for example this program!!

I closed a bunch down, but it appears that there are more users/logins than I have windows, after excluding the "--init user" processes for Jonathan and I (I know nothing about them, but they sound very fundamental). Specifically, there are 33 "users", 29 terminal windows, and 2 "--init user"s... Does that mean something?

(yeaayea I'll close more down, just need a chance to make a detailed record of what I've done first!)

I would also monitor the ram usage of the program during run and ensure the program is acting as you expect.

I've tried my very best to do this manually (literally sitting at my computer for a few hours watching the screen and checking the memory while I multi-task work... I live an exciting life) but the lowest I've seen it go is ~500MB free. The error has also occurred when I have over 1.2GB free. Do you know of a way to record the memory usage at a set interval so I can scroll through the output and look for memory spikes?

@kmcgregor123456https://github.com/kmcgregor123456, q#2 up top is still for you. Gracias por tu ayuda in advance. Also, how is pano lookin' these days? Would it be alright if I took her for a spin on PanTrack?

You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://github.com/CVNRneuroimaging/infrastructure/issues/163#issuecomment-199680204


This e-mail message (including any attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this message (including any attachments) is strictly prohibited.

If you have received this message in error, please contact the sender by reply e-mail message and destroy all copies of the original message (including attachments).

simonero commented 8 years ago

@kmcgregor123456

Additional disks to add to rama or pano? I don't need that much space for the pano test I don't think, and can delete most of the intermediary files afterwards, and then move the rest back to rama via "backup".

And where is the disk, physically? I can add it next time I am at the WMB.

Also, is this confirmation that rama is fixed?

rrmm commented 8 years ago

I closed a bunch down, but it appears that there are more users/logins

in the FROM column, :0 and :1 are logged in from the console, ie, from in front of the computer. The ones from :50 are probably an x2go session. Many of those have been idle for quite some time.

I would also monitor the ram usage of the program during run and ensure the program is acting as you expect.

gnome-system-monitor, htop, free -h seem like the way to go.

i would also check that the input files are valid, since the program uses information in the files to allocate memory.

simonero commented 8 years ago

@rrmm

in the FROM column, :0 and :1 are logged in from the console, ie, from in front of the computer. The ones from :50 are probably an x2go session. Many of those have been idle for quite some time.

i would also check that the input files are valid, since the program uses information in the files to allocate memory.

figured out how to log out of things remotely with kill, that already seems to help a bit. Woo!! But this time I got the error in a different way (a dialogue box) and there is still an error whereby it is saying it can't find some files that do exist and then spitting out an error. The next step after memory is to create a script that will let me hopefully check the quality of the files (we're talking ~1,000,000+ files) in case something broke down elsewhere.

gnome-system-monitor, htop, free -h seem like the way to go.

Never heard of gnome-system-monitor before, glorious tool, thank you! I'm getting some errors, though, when I run it. Don't know their significance. I ran with 3&> to make a log of these errors, attached.

gnome.log.txt

Also, gnome-system-monitor is not installed on pano. Can you please add this so I can use it during troubleshooting?

rrmm commented 8 years ago

gnome.log.txt

those errors shouldn't matter.

Also, gnome-system-monitor is not installed on pano. Can you please add this so I can use it during troubleshooting?

it should be installed now.

simonero commented 8 years ago

@rrmm I can't seem to get on pano from X2GO. SSH from my OSX terminal is fine.

I've played with the settings, but here are the default (what I use for rama) and the error:

How do I get past this?

screen shot 2016-04-03 at 4 08 05 pm screen shot 2016-04-03 at 4 07 05 pm
rrmm commented 8 years ago

@rrmm I can't seem to get on pano from X2GO. SSH from my OSX terminal is fine.

I've played with the settings, but here are the default (what I use for rama) and the error:

How do I get past this?

you can probably just delete ~/.ssh/known_hosts (or remove the offending host key from the file) on the computer you are running the x2go client.

simonero commented 8 years ago

Woohooo!! Thank you so much. I had a feeling this would be something incredibly simple that I would never figure out. Worked like a charm. =)

simonero commented 8 years ago

@rrmm @kmcgregor123456 Can't get on rama with x2go. The error I get is similar to one of the errors I get in the crashlog for the program I can't get to work right:

The only thing different is that I shut down the x2go session to decrease the allotted memory to the "recommended" level, because I had it at max and it was causing my CPU/memory to be maxed out all the time. But, I can't imagine why that would be a problem for simply logging into rama.

Terminal SSH works. Same indicated about updates needed, most of them security updates.

image