CVNRneuroimaging / infrastructure

Issue tracking, system documentation and configs for operations side of the neuroimaging core @ Atlanta VA CVNR / Emory University
3 stars 2 forks source link

Update on Rama+Software Issues #170

Closed simonero closed 7 years ago

simonero commented 8 years ago

@rrmm

I am still in the process of troubleshooting but it felt like a good time for an update because I've done a good bit of processing and things have changed a bit.

(1) I have not yet been able to replicate the memory errors I was getting in the terminal prior. I have run my program on Rama 7 times (would be 9 but Rama crashed during 2 of them on Fri) and have done this both with nothing else running and with 2-3 instances running at once.

(2) I am reliably getting an error processing 6 specific files (out of 20K+) with my problem program (track_network). It appears the output failures for every iteration are identical and occur at the exact same time in processing. Absolutely no idea why this is happening. I created a script to process one of these files with track_network (with the same input/output files used when the failure occurs) 20000 times serially (to replicate what my program does and how many times it does it) and ran it several times, in both for and while loops. Worked perfectly every time. Very odd...

(3) I wrote a few changes to troubleshoot this on a code level now that the error is different/reliable. I haven't had a chance to implement it yet, but if the error persists I'll let you know and give you the information you need to troubleshoot yourself like we discussed.

(4) Pano is taking FOREVER. My program is not even close to getting to the c++ program that crashes. Have made some efficiency edits that shouldn't affect processing (unless I forgot a syntax character in which case it will just break). However.... prior to this, with the version I've been running on Rama, at least twice processing randomly halted for no reason and with no error. Not replicable on re-run of the exact same command, happened at different times...

(5) Rama is crashing programs excessively. Almost every time I am logged in it happens at least once. Specifically, track_network and fsl programs (e.g. applywarp, flirt, fnirt, fslmaths). I thought all of these would save crash logs so I didn't record/prntscrn, but I only see a few in /var/crash. The errors have SIGSEG and int_free() terminology in the initial report or SIGABRT(sp?). However, my output files are always produced/fine as far as I can tell, and typically I can observe that the program had already finished running. E.g., 2min after running a script that calls applywarp I'll get a crash while I'm in a different tab doing something unrelated. I've had these pop up immediately upon log-in as well.

Sorry to write a novel. Thoughts on #s (4-5)?

rrmm commented 8 years ago

(5) Rama is crashing programs excessively. Almost every time I am logged in it happens at least once. Specifically, track_network and fsl programs (e.g. applywarp, flirt, fnirt, fslmaths). I thought all of these would save crash logs so I didn't record/prntscrn, but I only see a few in /var/crash. The errors have SIGSEG and int_free() terminology in the initial report or SIGABRT(sp?). However, my output files are always produced/fine as far as I can tell, and typically I can observe that the program had already finished running. E.g., 2min after running a script that calls applywarp I'll get a crash while I'm in a different tab doing something unrelated. I've had these pop up immediately upon log-in as well.

fsl seems to be dying in read_volumeROI and read_volume4DROI in libnetimage . applywarp is a slightly different problem.

One thing i would try is adding 'sleep 2' or something in the script to slow down successive calls to the program. I've noticed this helped at times when dealing with nfs issues.

rob

simonero commented 8 years ago

What would the cause of those fsl errors be? Is there any potential corruption in my files? I was having some ITKSNAP issues that I could imagine would be associated if so. And these volumes are also used for the processing I'm troubleshooting, though they are "fslmaths -roi"-ed intensely prior.

simonero commented 8 years ago

@rrmm Next update installment: processing on pano, although laboriously long, went off without a hitch as far as I can tell. need to double check that I don't get the same results on rama, as I made some necessary changes for efficiency. I'm pretty sure that I've run this version already, or else I would stall on the update, but I ran a lot of things on rama while waiting on this and never had a fully clean and expected result. will update again once this has run clean on rama. but this is pointing towards rama as possibly playing a role in the problem. will hopefully be able to confirm tomorrow - I'm running this from start to finish so it will take a while.

rrmm commented 8 years ago

What would the cause of those fsl errors be? Is there any potential corruption in my files? I was having some ITKSNAP issues that I could imagine would be associated if so. And these volumes are also used for the processing I'm troubleshooting, though they are "fslmaths -roi"-ed intensely prior.

It definitely points towards possible issues with the files. Especially if you are subdividing them recursively/automatically, you run the risk of getting ROI's that are degenerate in some way, (eg do not enclose an actual volume, are too close to machine precision, have negative distances, etc).

simonero commented 8 years ago

@rrmm Ok, so this is what has and hasn't been working.

On rama, I ran the exact same version from pano and it produced identical results, seemingly with no error. On rama, I ran the previous version (before I made edits for pano so it could speed up and finish) from start to finish and the exact same files corrupted as the many runs prior.

Next, on rama, I ran the pano version and removed "wait" from before the c++ scripts in the two steps that proceed the one that crashes. No error.

Just now, I ran the exact same scripts but only removed "wait" from one portion. An error occurred, well before this became relevant, during an fsl command. I have never seen this before.

I got a nifti read error from fslstats. In this part of the script, I run these commands and echo each to the terminal on verbose:

   xminVox1=`fslstats "${roi1}" -w | cut -f 1 -d ' '`
   xmaxVox1=`echo "${xminVox1}+${xsizeVox1}-1" | bc`
   yminVox1=`fslstats "${roi1}" -w | cut -f 3 -d ' '`
   ymaxVox1=`echo "${yminVox1}+${ysizeVox1}-1" | bc`
   zminVox1=`fslstats "${roi1}" -w | cut -f 5 -d ' '`
   zmaxVox1=`echo "${zminVox1}+${zsizeVox1}-1" | bc`

After the first two, for just one file, it stopped working. Fine for the next ROI file this ran on. Manual testing does not create an error. Here is a screenshot of my output with errors for ROI1 and no errors for ROI2:

image

3dinfo, fslinfo, and fslhd also work file on this file during manual check.

Still file problem and not FSL?

If looking at the file would help, here is the path to the exact file this occurred with:

/data/localdatarama1/05.PanTrack_Diagnostic/Take12_fromstartwithROIs_panoProgramFiles_reverting2/ROIs/Lthalamus/Lthalamus.nii.gz

And here is the file that did NOT cause an error:

data/localdatarama1/05.PanTrack_Diagnostic/Take12_fromstartwithROIs_panoProgramFiles_reverting2/ROIs/LPTr/LPTr.nii.gz

simonero commented 8 years ago

Still having issues with FSL. While working on something other than troubleshooting my program I noticed that flirt & applywarp crashed several minutes after I accidentally ran a script that calls FSL several times prior to creating the intermediary files I needed (i.e. lots of expected errors). This script finished on its own without me interrupting it. When the FSL programs crashed Rama became suddenly and noticeably slower. Still having problems with slowness despite more than adequate free memory (4-5G). The 2 crashes only account for 2/9 of the FSL script usage in the script I tried to run. Other processes not listed on pstree.

edit: its been several hours and the lag is still really bad. I'm doing a lot but I've done these processes before without similar issues. It took me an hour to move a few nifti files onto my local machine. I get a lag just from typing. I can barely work.

simonero commented 8 years ago

I recently had a version of my program working on Rama+Pano. It spontaneously stopped working on Rama, and once again is failing on specific files. When this occurs, I get a window telling me that the c++ program "track_network" has crashed. The only change made to my code that affects processing up to this step is purely aesthetic (I replaced a string that occurs in all of my verbose messages).

@rrmm, should I send you the path to the last failed & successful attempts, and the info you need to run it yourself? Or, should we first meet briefly on Thursday to discuss the troubleshooting I've already done since we last spoke?

rrmm commented 8 years ago

I recently had a version of my program working on Rama+Pano. It spontaneously stopped working on Rama, and once again is failing on specific files. When this occurs, I get a window telling me that the c++ program "track_network" has crashed. The only change made to my code that affects processing up to this step is purely aesthetic (I replaced a string that occurs in all of my verbose messages).

  • To troubleshoot, I moved the latest version that is reliably working on Pano thus far to Rama (functionally identical through the step that fails). The error is still occurring, even though it does not occur on Pano.

There can be bugs in the program that may not cause fatal errors. Changing unrelated parts of the program can cause them to become fatal (due to changing memory layouts, changes in stack layout, access patterns, race conditions, etc).

Also, things you might assume to be inconsequential changes may not be. Either directly or because they change memory layout. Changing hosts may also trigger changes in memory layouts (different library versions, address randomization, etc).

I would look at the commonalities of the files which trigger crashes. I might also go as far as to programmatically verify the files are complete and valid before using them in the next step.

simonero commented 8 years ago

@rrmm

Sorry for the delay. Had to leave town for a family emergency.

It seems that the program currently works fine on pano & an OSX computer at GSU but not rama. Which is interesting given pano once mimicked the error, before I continued to try to address it. And given this version did briefly work on rama before it suddenly stopped.

The OSX version I used was identical at the step that usually fails, but in an earlier step I changed "-ne" to "=" in 2 if statements, and removed xargs -P 0 from a couple statements that create a text file with a list (for cross-platform compatibility), but otherwise it was identical.

I have already written scripts that vet the track files systematically prior to the step that crashes by reading the headers from each. I have also looked at the integrity of the ROI files used as input in the steps that fail and they seem to be fine, as far as I am capable of troubleshooting. I've also manually run the step that fails on a set of files that is always involved in the crash, and it works no problem.

Based on what you've said I think the next step is to re-install the c++ program on rama and see what happens. I'll update after this is complete. Also, Bruce and I discussed eventually having you look at the c++ source code if my troubleshooting continues to be fruitless, as you are much more qualified than I am to analyze these.

simonero commented 8 years ago

I have encountered a very frustrating and confusing problem. I am unable to re-install the c++ software suite.. I have no idea why. User error cannot be ruled out, but I don't understand what I might be doing differently than the times I've installed this program before.

I have tried using the last set of files (for the suite TrackTools) that I downloaded on my computer and also the first. I have downloaded the TrackTools files (prior to making/building) on both my local Macbook and directly onto Rama itself. I am unable to install this on both my own computer and on Rama. I have also tried deviating from the installation instructions, on Rama. I am told to "make realclean" then "make". If I simply "make" I end up with OSX compatible files, instead of Linux compatible.

@rrmm, could I please point you towards this software and ask you to try installing it, so that if this is user error we can figure out what I'm doing wrong, or if it is not so you can play a role in troubleshooting it moving forward? I think this is beyond what I am qualified/able to figure out. It doesn't matter where you install the files as long as they are somewhere on Rama - if it works we can always move them into the folder I keep them in. The installation instructions are in a file named "README" once you extract the folders.

The files can be downloaded here: http://marecilab.mbi.ufl.edu/software/TrackTools/

A zip file of the first set of files I downloaded locally (on Jan 8, 2015) is here: @rama: /opt/PanTrack/supporting_software/TrackTools/TrackTools.zip

A log of what I tried to do in my last attempts, which includes the errors, is here: 20160624_log_making.txt

Please let me know if this is something you can work on and if you need any more information from me to do so. Thank you Rob!!!

rrmm commented 8 years ago

\

I have tried using the last set of files (for the suite TrackTools) that I downloaded on my computer and also the first. I have downloaded the TrackTools files (prior to making/building) on both my local Macbook and directly onto Rama itself. I am unable to install this on both my own computer and on Rama. I have also tried deviating from the installation instructions, on Rama. I am told to "make realclean" then "make". If I simply "make" I end up with OSX compatible files, instead of Linux compatible.

the osx files are included in zip. they are not rebuilt when you just do make.

in your log, the linker is not finding the functions it needs to create the executable.

try going into TrackTools/track_tools/src in that directory open the Makefile and make the following changes to two lines so they look as follows:

INCFLAGS=-I. pkg-config --cflags zlib LIBS=-L../lib -lniftiio -lznz -lm pkg-config --libs zlib

(note the use of the backticks, if you don't copy and paste). then try make realclean and make again as the readme describes.

simonero commented 8 years ago

@rrmm

Still doesn't work.