CPU spike after multiecho scan

jeffreyluci commented 11 months ago

We recently started acquiring 3-echo ME EPI. The sequence always completes, and the images are always reconstructed in full. However, sometimes (maybe 75% of the time), the CPU on the host computer spikes, the View&Go tab freezes, and eventually, the whole scanner crashes.

I have tried many times to find a way to make this happen reproducibly without success. But, if I run the scan once and it doesn't happen, if I re-run the exact same scan (using Append), it does every time. I have reduced the protocol to only this one scan and run it alone and still see this happening. It has never happened without this scan having been run. Even if we run the protocol with this scan, but don't reach it in the scan list, everything works without issue.

This is a Prisma running XA30A. Here is the version screen:

I have attached an exar1 file with the offending scan, and this is a link to download a savelog acquired just before complete freezup.

https://rutgers.box.com/s/ofzvkalgnf1z9oivf7g1swstm19xzn7w

If you have any suggestions or advice, or if you need any other information that would help chase this down, I'm more than happy to oblige.

Thanks!

mharms commented 11 months ago

Hi, You have run into what appears to be a fundamental limitation in how the imaging database is set up for XA, which is especially problematic for multi-echo. To my knowledge, there is unfortunately no simple solution currently. You could try the following:

Make sure that you are using the very latest version of the "R017pre7" sequences (24-Oct-2023), esp. if you are trying to record Physio at the same time
Close View&GO
Keep the imaging database from becoming "too full"
Make sure there are no other "operations" going on at the same time (e.g., data exports or auto-sends)
Use 12-bit rather than 16-bit recons (via option on Sequence:Special tab)

If those don't provide a viable work-around, then I think the only path forward currently might be to do something similar to what the MR physicist did at WashU to get multi-echo working here, which was to implement a custom ICE module that I believe only lets the first echo get sent to the database (for online, real-time review) with all the other echos written out to some binary file (but still having recon on the scanner).

jeffreyluci commented 11 months ago

@mharms, Thanks for the note.

I tried several of your suggestions. When I close the View&Go tab and suppress 16-bit DICOMs, the CPU load remains high, but manageable. It never reaches the point where the host becomes unresponsive or unstable. It certainly hasn't crashed like this.

While I'd really like to keep the 16-bit dynamic range for functionals, especially multiecho runs that use more of the dynamic range than single echo versions, this will be how we proceed until and unless another solution becomes available.

With regard to your custom ICE package, do you retrospectively reconstruct the later echos? Or, are they reconstructed in real time but not inserted into the database until later? (I suspect it uses retro recon, but thought I'd see if there was an ICE method I'm unfamiliar with!)

For the record, We have the Advance Host, and the MaRS with the High Performance Computing option. I know earlier Prismas exist with lesser options, and I just wanted to let people here know ours isn't one of them.

Thanks again, Jeff

mharms commented 11 months ago

I believe what our physicist set up is that the recon itself proceeds normally, but only the 1st echo gets sent to the imaging database. The recon'ed other echoes are somehow written as binary directly to a file.

jeffreyluci commented 11 months ago

Hmmm. I didn't know that was possible.

What is the reason for the binary file? If the images are reconstructed, What does the binary file provide that the raid dat file doesn't?

mharms commented 11 months ago

The binary file is the reconstructed image-domain data. The dat file would be the raw k-space data, right?

mharms commented 11 months ago

So, this solution avoids the need to recon the raw k-space data off the scanner.

jeffreyluci commented 11 months ago

Yes. That's right. I think I understand now. Only the first echos are sent to the database, and the later echos are reconstructed and stored in a separate file. Excellent.

Is there a way for the scanner to construct DICOMs from the extra data file? If there is, I might ask your physicist if they would mind sharing with us. That would be an ideal solution.

cihateldeniz commented 11 months ago

Thanks Dr. Harms.

Hi Jeff,

I am the MR physicist at WashU. Dr. Tim Laumann and myself arrived at the solution Dr. Harms is referring to above. Here are a few more details:

The reconstruction block is placed all the way at the end, just before the last Siemens block. It allocates two large matrices with size #Columns x #Rows x #Slices x #Echoes (one for magnitude and one for phase) and saves all echoes as they arrive. After that, it sends out the images only for the first echo for DICOMization. At the very last echo (i.e. when the large matrix is filled out completely), the block writes out the binary files into a folder located at C:\ProgramData\Siemens\Numaris\SimMeasData. Each file is written out with the following naming convention for, say, Measurement 5:

PatientID_SeriesInstanceUID_Mag_0005.dat PatientID_SeriesInstanceUID_Phs_0005.dat

PatientID is the patient ID used during patient registration. SeriesInstanceUID matches perfectly with the DICOM tag (0020,000E), except that the DICOM header also contains six additional characters ".0.0.0". This makes it possible to connect the first-echo images saved on the disk with the first-echo DICOM's. SBRef images as well are saved with an additional "_SBRef" in the name, just after the SeriesInstanceUID. The fact that you have both versions of Echo 1 always serves as a sanity check - you can compare the two images to see if they are identical. As one might expect, all header information will need to come from Echo 1 during the DICOM to NIFTI conversion. Andrew Van with Tim wrote a wrapper for this. Andrew also wrote a PowerShell script for the scan operator to copy these files into a network drive on the scanner (into a folder with patient ID in the name), which gets scanned daily by the DICOM database to check new arrivals. Worst case, the scan operator can also copy these files manually into their USB drive.

I hope this helps. Please let me know if you have any questions.

I can upload the solution and the instructions to Siemens' C2P TeamPlay platform if you would like to give it a try.

Thanks.

jeffreyluci commented 11 months ago

Hi @cihateldeniz!

That sounds great! Yes, we would like to give it a try.

If I can do anything to help facilitate this, please let me know. I'm happy to do legwork or make resources available that might help.

Thanks a ton, Jeff

cihateldeniz commented 11 months ago

Great! I will prepare the package (which will also include a Matlab script to read the binary file correctly) and upload it to TeamPlay by midnight on Friday (Central Time). I will post a confirmation here when it is online.

Thanks.

cihateldeniz commented 11 months ago

Hi Jeff,

The package is now on TeamPlay - please see the screenshot attached.

Please give it a shot. I hope it goes well.

Thanks.

TeamPlayScreenshot

jeffreyluci commented 11 months ago

Thanks a ton! I'll try it first thing Monday morning.

Jeff

From: cihateldeniz @.> Sent: Saturday, November 18, 2023 12:43:40 AM To: CMRR-C2P/MB @.> Cc: Jeffrey Luci @.>; Author @.> Subject: Re: [CMRR-C2P/MB] CPU spike after multiecho scan (Issue #341)

Hi Jeff,

The package is now on TeamPlay - please see the screenshot attached.

Please give it a shot. I hope it goes well.

Thanks.

[TeamPlayScreenshot]https://user-images.githubusercontent.com/64932614/283978579-5f4d2dca-ba5a-4b49-acc2-19f65a07fd45.png

— Reply to this email directly, view it on GitHubhttps://github.com/CMRR-C2P/MB/issues/341#issuecomment-1817402844, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AVA7UOLNYWOSDROO3UEVM6DYFBDIZAVCNFSM6AAAAAA7GTP57GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJXGQYDEOBUGQ. You are receiving this because you authored the thread.Message ID: @.***>

rhancockn commented 10 months ago

To add my experience, I had the same issue on pre6 with 4-echo sequences (with DICOM physio logging and 16 bit DICOM disabled) consistently exhausting the 64GB of host RAM and crashing the system after about 12 min, even with reconstruction turned off on the program card. Updating from pre6 to pre7 was a big improvement! On pre7, we can usually get through an 18 minute run, but it's critical to keep an eye on the system load and let the load drop down to ~25% between scans, especially before longer scans.

I'm looking into filling the open memory slots on the system and bringing the RAM up to 128GB—has anyone tried this as a solution?

rwmair commented 9 months ago

On pre7, we can usually get through an 18 minute run, but it's critical to keep an eye on the system load and let the load drop down to ~25% between scans, especially before longer scans.

I'm looking into filling the open memory slots on the system and bringing the RAM up to 128GB—has anyone tried this as a solution?

I was re-reading this thread as our own XA30 upgrade approaches, and we have a few groups doing multi-echo BOLD.

Your 18-min scans using R017pre7- are they still with 16-bit DICOM and inline reconstruction turned off? Or can you now run conventionally? Has anything changed/improved in the 6 weeks since?

I didn't know the PC's had open memory slots. Did you try filling them? Our CSE is very attuned to the research environment so would likely support that if its possible/beneficial.

jeffreyluci commented 8 months ago

With the enormous help from @cihateldeniz and others from BIDS-land to document data provenance, I was able to implement his workaround for CMRR multiecho data. Unfortunately, the Python packages that build NIfTIs from the raw pixel data are no longer supported, and several are deprecated. Rather than fiddle with maintaining multiple Python distros, I decided to write my own code to fill that need. I'm a MATLAB guy, so it's in MATLAB. If your data do not reside on an SSD, the processing is slow (~3 minutes for a BOLD run with 775 timepoints and 3 echos), but it works. If your data does reside on a fast SSD, the speed is improved significantly (<20 seconds for the same dataset).

If you care to use this workaround, see Cihat's links above to download and install the recon functor, and my code to build the NIfTIs is available here:

https://github.com/jeffreyluci/Siemens-Tools/tree/main/dat2niix

Good luck, Jeff

jeffreyluci commented 7 months ago

@rhancockn: For what it's worth, I used resource logging to see just how much memory our 3-echo runs were taking. During the scan, the memory usage was minimal - 7-13%. After the scan, it would creep up a bit to a max of about 15%. It never went higher than that. So, I'm not sure maxing out the memory is going to get you where you want to go. (I do have the high performance workstation, though. So, if you don't, that might be a difference). But, hold on, because there may be good news here (vide infra).

@cihateldeniz and @mharms: We started to see this kind of behavior not just with mutliecho CMRR runs, but during single-echo PRODUCT EPI runs, too. As a matter of course, I made a ticket with UpTime, and they dispatched our FSE to check it out. That was just pro forma, of course. But, he did show up with SP03 that we were scheduled to have installed next week. We figured that we were going to have to do it anyway, so why not now? That was Friday. On Saturday, I reinstalled all our C2P sequences and tested them. Even without NORDIC, I had no problems running. We ran 6 hours of 3-echo EPI testing this morning without NORDIC. This was just the CMRR sequence running without help. The CPU would occasionally spike to around 80%, and the memory would climb to around 40-50%. But those events would only last for a few seconds, and even then, the host never became unresponsive.

I have more to do before I'm comfortable being confident in this. To wit, we have a research scan in ten minutes, and we're going to stick with NORDIC for that. But at least initially, it looks like Siemens might have fixed whatever was causing this in XA30A-SP03. I'll report back with results as I get them. Fingers crossed.

eauerbach commented 7 months ago

Good if it is looking better in SP03, but I think I can say everyone acknowledges that it is a real problem (not just affecting the C2P), and it is not fixed in any current release.

jeffreyluci commented 7 months ago

@eauerbach Does that mean you think think SP03 is not a real fix? Like I said, I'm not at all confident that it is, but if you KNOW that it isn't, I guess we'll soldier on using NORDIC as a workaround.

Do you have any info on what the source of the problem is? I'm not asking you to break an NDA or anything. But if someone has tracked down the root, it would make me feel better that at least someone is working on a fix.

eauerbach commented 7 months ago

No, SP03 does not contain a real fix; yes, the problem is basically understood and we are working on it, but the real fix is weeks if not months out. If SP03 is working better for you, great. Hopefully the UI is more stable and more consistently responsive in SP03. But I would not expect meaningful improvements to the overall image reconstruction speed....that is a deeper problem. If the WashU workaround is working for you, also great (but that will never be "officially" supported).

jeffreyluci commented 7 months ago

Excellent. Thank you. That is enormously helpful.

I wouldn't expect the WashU workaround to be supported here. I'm just using this forum to help others get around the issue as best I can.

rwmair commented 7 months ago

@jeffreyluci,

Thanks for your testing and reporting back. I'd been meaning to do that same. We upgraded our Prisma from VE11C to XA30 3 weeks ago, and by that time, we received SP03 on the PC from Germany. While multi-echo BOLD hasn't been a prime focus of our efforts, a colleague and I have tested a 4-echo TR=1.3 s protocol for up to an hour without the PC UI becoming unresponsive or crashing. During the first scan ~ 7-8 mins, the server load stayed low. After that it ramped up into the 80-90% range as the DICOMs started coming into the host PC and getting put into the patient browser. However, it sat at those levels for lengthy periods without an impact on the host PC performance. (Admittedly, we weren't asking it to do much else, including keeping the View&Go window closed)

Maybe the real root problem is not fixed, but SP03 appears to be enough of an improvement to make the system function at the most basic level without appearing to slowdown or fear crashing. I'll look forward to further improvements if there's more in the coming weeks/months. To that end, I don't find the reconstruction time a problem, as the inline display shows the images in near real-time. Rather, the delay in getting the images into the Patient Browser and then the View&Go is a real problem (for basic single-echo BOLD scans now), especially for our groups that scan small children and want early feedback on how bad the motion is. We'd trained them to inspect the SBRef image within the first 15-20 seconds of the scan starting, but now that is not usually visible until near the end of a ~ 5 mins scan.

jeffreyluci commented 7 months ago

@rwmair: Yes, I forgot to mention that we keep View&Go closed for these scans. My intuition is that contributes little, if anything, to the stability, but I have no hard evidence to support a conclusion either way. Since we really don't need View&Go during those scans, it doesn't hurt to close it, so that has been our SOP. The same goes for the inline viewer. I don't know if it hurts, but in the absence of a need, we just don't open it.

Keeping an eye on the realtime results for motion detection is a bit of an issue, though. In order to solve that problem, I have been developing an open source viewer and motion tracker that will function the same as a formerly free solution you might be familiar with that rhymes with "BERMM." We cant justify paying what they charge now, and they are unwilling to give us any kind of break even though I did a TON of work to help them expand the functionality to support XA. So, I'm making my own. It isn't ready yet. The realtime coregistration is a bit too slow, but I think I'm close to getting that working. The realtime viewer does work. It runs on a realtime computer that we have set up on the scanner CAN. Currently, it looks for DICOMs written out using the BOLD Addin to a network mapped drive. But, it would be relatively simple to instead have it look for just the raw pixel data written by the NORDIC pipeline. Is that something you could use? Or, do you find the inline display to be robust enough?

rhancockn commented 7 months ago

@jeffreyluci, Thanks for sharing your experience! It’s good to know SP03 might help. I did get an additional 32GB RAM installed in the host, which was a huge help for my case. Interestingly, the memory usage hangs around 30% now (not ~60% as one might expect). The MGH vNav sequences also start more reliably after the upgrade. On Feb 26, 2024, at 2:02 PM, Jeffrey Luci @.***> wrote: @rhancockn: For what it's worth, I used resource logging to see just how much memory our 3-echo runs were taking. Dur the scan, the memory usage was minimal - 7-13%. After the scan, it would creep up a bit to a max of about 15%. It never went higher than that. So, I'm not sure maxing out the memory is going to get you where you want to go. (I do have the high performance workstation, though. So, if you don't, that might be a difference). But, hold on, because there may be good news here (vide infra). @cihateldeniz and @mharms: We started to see this kind of behavior not just with mutliecho CMRR runs, but during single-echo PRODUCT EPI runs, too. As a matter of course, I made a ticket with UpTime, and they dispatched our FSE to check it out. That was just pro forma, of course. But, he did show up with SP03 that we were scheduled to have installed next week. We figured that we were going to have to do it anyway, so why not now? That was Friday. On Saturday, I reinstalled all our C2P sequences and tested them. Even without NORDIC, I had no problems running. We ran 6 hours of 3-echo EPI testing this morning without NORDIC. This was just the CMRR sequence running without help. The CPU would occasionally spike to around 80%, and the memory would climb to around 40-50%. But those events would only last for a few seconds, and even then, the host never became unresponsive. I have more to do before I'm comfortable being confident in this. To wit, we have a research scan in ten minutes, and we're going to stick with NORDIC for that. But at least initially, it looks like Siemens might have fixed whatever was causing this in XA-30A-SP03. I'll report back with results as I get them. Fingers crossed.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

jeffreyluci commented 7 months ago

@rhancockn: I'm confused - why would you expect the host to hang at ~60% usage? I would think that barring address overruns, the system should be able to handle that with ease. Or, really, even much more than that - up to the high 90s%. Am I missing something?

Did you match the same memory that was already installed in your system? Did Siemens help with any of that process?

Thanks, Jeff

rwmair commented 7 months ago

@jeffreyluci, thanks for all the extra input. I also have no personal experience to blame View&Go, just the early XA30 experiences of @eauerbach, where it was apparently constantly appearing in his error logs. We'll monitor the experience of others with SP03 in this area. I keep "pushing the envelope" a little by keeping View&Go open and loading some scans up to view while I'm running. Certainly loading a BOLD scan drives the server load way into the yellow, although I haven't experienced any downside of that.

I'm familiar with the software you mentioned, and we also can't justify the cost of the product version. We were already developing a little software tool to catch simple errors (like top of head coil not being plugged in) and so added a basic motion-correction calculation for BOLD scans, tho at this point it runs after the scan is complete. And with the delays in the data arriving on the XA30 PC, this process is slowed down further. I have it on my to-do list to look into the BOLD add-in in XA30 to stream the data sooner than before the scan ends, but will be interested to hear how your effort progresses.

cihateldeniz commented 7 months ago

Dear All,

For a number of reasons, we have just updated the keyword that activates the custom solution that DICOMizes only the first echo. So, rather than "BOLD_NORDIC", the solution will now look for "BOLD_MBME_DcmOnlyE1" in the protocol name. Please see the screenshot below for the updated announcement on TeamPlay:

Thanks.

CMRR-C2P / MB

CPU spike after multiecho scan #341