error: Flagging SNPs that are in noisy regions. New flags by sample:Killed

ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data

MIT License

110 stars 33 forks source link

error: Flagging SNPs that are in noisy regions. New flags by sample:Killed #124

Closed Teezi closed 2 months ago

Teezi commented 3 months ago

Hi Chris,

I was ananlysing whole-genome sequencing data using superFreq with 10 CPUs and 200G of Memory. And I used

GATK mutect2 raw outputs as the somatic inputs,
GATK haplotypeCaller raw outputs as the germline inputs,
6 reference normals

Some of my samples work very well. But others keep crashing with errors:

Flagging SNPs that are in noisy regions. New flags by sample:Killed

I suspected it was a memory-related issue. So I did the following:

Computational resources were increased to 22 CPUs and 440G of Memory => issue remains
The number of reference normals were decreased to 3 => issue remains

What are your thoughts on this error? Could it be a memory issue? Do you think using the filtered outputs from MuTect2 and HaplotypeCaller would be beneficial?

Any insights will be much appreciated !!

ChristofferFlensburg commented 3 months ago

Hi!

Yep, looks like a memory related issue. Genomes in general is pushing superFreq (or R really) in terms of resources. It's possible to get runs through, but requires more attention than exomes or RNA-Seq. Seems like you made it pretty far in your run, so just a few tweaks should allow you to get it through.

SuperFreq memory usage scales with cpus (due to a flaw in R), so easiest way to decrease memory usage is to decrease the cpu input parameter in superFreq(). So if you go down to 6 or 4 cpus (or worst case 2 cpus), and give it 440G, then likely to go through, although a little slower.

There are regular save point in superFreq, and less stuff gets stuck RAM when results are loaded from a save-point, so sometimes it's enough to just resubmit a few times, and it'll get a couple save points further each time...

Reducing input variant from GATK is another way to go, but should be last resort as you risk removing important variants. In your case, seeing that you already made it pretty far, I think just reducing cpus and rerunning (a couple of times) should be enough.

Teezi commented 3 months ago

Thanks for your repid response!!

Understood... I will use 6 CPUs and have a go.

Regarding memory, can I specify a parameter within the superFreq() function, or is setting it only with#SBATCH --mem=440GB (as I am using HPC).

Regarding regular save points, does this mean I don't need to delete the outputs from the R/ and plots/ directories for the failed specific sample when re-running the analysis?

ChristofferFlensburg commented 3 months ago

The setting in SBATCH is how many cpus are allocated to your machine that you run on.

superFreq() setting sets how many cpus superFreq will actually use, and also affects memory use.

So changing the sbatch cpu isn't going to affect how much memory superfreq is using. You want to match the two to optimise use of your hpc resources.

ChristofferFlensburg commented 3 months ago

Oh, sorry, memory... No, there is no memory setting in superFreq. It just goes and assumes there is enough...

ChristofferFlensburg commented 3 months ago

And yes, the save points are in R directory, so keep those to avoid rerunning from the start.

Teezi commented 3 months ago

I see and will give it a shot. Thanks a lot !!

ChristofferFlensburg commented 3 months ago

Good luck, let me know how it goes so I can close the issue.

Teezi commented 3 months ago

Good luck, let me know how it goes so I can close the issue.

Sure!