Illumina / manta

Structural variant and indel caller for mapped sequencing data
GNU General Public License v3.0
407 stars 154 forks source link

Understanding excessive memory usage #154

Open ernfrid opened 6 years ago

ernfrid commented 6 years ago

I've been running a large number of ~ 30X human CRAMs (aligned to GRCh38DH) through Manta for germline SV detection. I've typically been providing 16GB of RAM and 8 cores. For the majority of samples, this is sufficient, but for a small subset I need additional RAM and I've recently encountered a sample that utilized ~70GB of RAM to complete.

Unfortunately, I will be unable to share the CRAM in question. Is there any part of the output or log files that might shed light on the source of this excess usage? In particular, I'm curious if there is a particular region or contig that I might exclude from my analysis to reduce the usage (or use to predict excess usage in advance).

ctsa commented 5 years ago

Thanks @ernfrid - Sorry for the delay. In general, we haven't had problems with high memory usage, the main area we've had to deal with this is for certain FFPE samples going into the tumor-normal somatic analysis with chimeric read-pair rates above 10%, but even in these cases, it was a question of whether the graph merge (which has always been the point of peak memory demand form in-house observations) would exceed 2Gb -- so if you're observing a problem approaching 70 Gb this is likely to be a different type of issue.

A few questions which might us make suggestions:

I would also note that the statistics files output by manta, summarized here:

https://github.com/Illumina/manta/tree/master/docs/userGuide#statistics

....could be helpful. These stats are primarily focused on runtime rather than memory debugging, but if you could share these files for a problematic case we might be able to spot something helpful.

ernfrid commented 5 years ago

No trouble on the delay.

If I limit the memory to ~12 GB and a single core, one of the subjobs running GenerateSVCandidates crashes. When running the same sample through with only a single thread and excess RAM, I saw peak usage of ~14GB so it doesn't seem to be a particular subjob using huge amounts of RAM by itself. I've attached the stats directory (with paths redacted) for the 8 core run that consumed 70GB.

stats.tar.gz

The manta version is v1.4.0

I haven't attempted the CRAM -> BAM conversion and run, but I will give that a try.

ernfrid commented 5 years ago

Changing the input to a BAM instead of a CRAM dropped the memory usage precipitously (to the point I almost don't believe it) to 1.1GB.

ctsa commented 5 years ago

Either a leak in htsilib or our API usage... Possibly this one?

https://github.com/samtools/htslib/pull/679

This fix appears in htslib 1.8. I will go ahead and see if the update to 1.9 is easy, and send you a branch link to see if we can confirm the issue.

ernfrid commented 5 years ago

That looks promising. I'd be happy to test it if you're able to point me to a branch.

ctsa commented 5 years ago

If you are setup to build manta from source, the branch is here:

https://github.com/Illumina/manta/tree/feature-MANTA-1483

ernfrid commented 5 years ago

Compiled and ran this branch. The samtools in libexec does report as v1.9.

Memory usage was ~72GB as reported by LSF. Seems like this didn't change the memory usage.

ctsa commented 5 years ago

Okay, perhaps we can reproduce the problem with an in-house TN sample? I would like to recreate your CRAM file conditions -- can you describe what command-line/tool versions/reference is being used to create these?

ernfrid commented 5 years ago

Sorry about the delayed response @ctsa

This CRAM was created with a pipeline conforming to the guidelines here: https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md

See also the pre-print here: https://www.biorxiv.org/content/early/2018/04/10/269316

This is not my pipeline, but it looks like Picard MarkDuplicates (v2.6.0) is being used. From the header I see:

ASSUME_SORT_ORDER=queryname TMP_DIR=[/tmp/] QUIET=true COMPRESSION_LEVEL=0    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO VALIDATION_STRINGENCY=STRICT MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.jso

I also see SAMBLASTER entries

ID:SAMBLASTER   CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24

GATK PrintReads

@PG ID:GATK PrintReads  VN:3.6-0-g89b7209   CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false

There should also be base recalibration, but I see no @PG entries and can't comment on those.

ernfrid commented 5 years ago

I think I may have discovered the issue here. The CRAM(s) in question appear to be missing the @HD line of the header. At least in one case, adding in this line (as @HD\tVN:1.5\tSO:coordinate seems to resolve the issue. Does that make sense from a developer perspective?

ernfrid commented 5 years ago

@ctsa - I'm going to resolve this since I'm fairly certain that the missing header line is the source of my issues and since converting to BAM or adding the header line remedies the problem.

I did go back and look at the logs and didn't see an indication that the missing line is triggering sorting (or some other issue). If there is a sort step in this case, as I am guessing, it would be wonderful if there was feedback to the person who was running Manta that sorting was occurring. My apologies if this is in there and I'm simply missing it or if I'm wildly off track on what is going on here.

ctsa commented 5 years ago

Hi Dave,

Sorry to lose track of this one. The effect of the @HD would have to be coming from somewhere in htslib itself, as Manta doesn't look at this detail in the header nor communicate anything about sorting status of the cram file -- it is assumed that the required index could not have been created if the file were not sorted to begin with, and an out of order read traversal should just trigger an assertion.

ctsa commented 5 years ago

Perhaps @jkbonfield could put some htslib developer perspective on this? As described in the above comments, when Manta uses htslib to make a large set of indexed genome region requests (similar to the calls that would be made by samtools view at the API level) all over the genome the observed memory usage is normal (ie. fairly low) when the indexed read requests are coming from a BAM file or CRAM file with an @HD coordinate sorting tag as described here:

https://github.com/Illumina/manta/issues/154#issuecomment-436755779

But for the CRAM file without that header line, the memory usage increases drastically. Do you have any notion of what could be triggering this?

jkbonfield commented 5 years ago

I don't understand why a coordinate sorted file that misses the coordinate sorted tag would make a difference, at least not from an htslib perspective.

There is some code to check sort order in relation to the CRAM "AP_delta" field. This is set to true by the encoder when writing out sorted data (as 10, 12, 15, 16, 16, 18, ... is smaller when written as 10, 2, 3, 1, 0, 3) and to false when the data is unsorted.

The decoder appears to double check this in as far as AP_delta==0 and no HD SO:coord will enter "unsorted" mode, meaning it caches the reference differently because it assumes we'll keep spamming through all references in a random order. (Incidentally if doing writing aligned data not coordinate sorted, it's probably best to simply encode using referenceless mode; "samtools view -O cram,no_ref")

I'm not familiar with manta, but is this high memory usage being seen when doing "samtools view -c -@8" to simply count the records in a CRAM file? What about when giving it many many regions to iterate through? It's possible we have another memory leak somewhere related to sort order, so I'll have a look.

You may find the heap profiling of tcmalloc a good place to start for diagnosing where most of your memory is being spent.

https://gperftools.github.io/gperftools/heapprofile.html

ernfrid commented 5 years ago

Thanks @ctsa and @jkbonfield. Unless you've independently replicated, let me triple check this behavior before proceeding and investigating some of James' suggestions.

ernfrid commented 5 years ago

Ok @ctsa and @jkbonfield. I've verified this again by reheadering to remove the @HD line. Currently I'm seeing 44GiB of RAM being used by Manta when the CRAM doesn't have the @HD line and only 1.1GiB when it does. I expect these will peak higher, and show usage close to what I'd reported above, once the processes finish. I'm feeling pretty confident I am not leading you astray.

I also tried @jkbonfield's samtools view -c test and see nearly identical RAM utilization. I haven't yet tried iterating through many regions.

ctsa commented 5 years ago

@jkbonfield per your question about Manta's access pattern: The problematic memory usage seems to be coming up in manta's second 'SV generation' phase, wherein each process will be cycling through a large number of SV breakpoint locations and requesting access for a window of several hundred bases surrounding each candidate breakend. This access pattern is quite different from the small number of large regions typically requested in many other variant calling activities, and thus Manta's SV generation phase has triggered some CRAM issues before:

https://github.com/samtools/htslib/issues/654 https://github.com/samtools/samtools/issues/455

ernfrid commented 5 years ago

So without the header, I see peak usage of 70.14 GiB and with the header present I see 5.47 GiB.

jkbonfield commented 5 years ago

I also tried @jkbonfield's samtools view -c test and see nearly identical RAM utilization. I haven't yet tried iterating through many regions.

Do you mean that samtools view -c is also using 40+GB of RAM? If so then it's definitely something in our code causing this high memory usage! Odd and a cause for concern. It does sound rather like a memory leak.

Is this tied to using multi-threaded mode? Ie with no threads (no -@ option) does it also gradually climb in memory?

ernfrid commented 5 years ago

Doh. That was poorly worded. I see only a slight difference in the rather modest memory usage of samtools view. It’s about 0.8GiB in both cases.

jkbonfield commented 5 years ago

Thanks for the clarification.

What htslib calls does Manta make? Is it a pure open and stream through the data set work-flow? Is it opening and closing many times? Opening once, but doing lots of seeks? Lots of iterator creation / destruction?

It's possible we still have a memory leak within htslib, but I'd suggest to start with using a memory leak checking tool (tcmalloc, valgrind, clang -fsanitize=address, etc first to see if it shows up where the source is.

jkbonfield commented 5 years ago

To double check, I just tried samtools view on a local CRAM file with and without the @HD header with 20 ranges, using valgrind, and it found no leaks. The total number of memory allocations and usage was identical too.

ernfrid commented 5 years ago

I'm afraid I'm not going to have time to figure out how to run Manta with valgrind to perform a similar analysis as @jkbonfield. @ctsa - I'm going to leave this open, but I've probably taken it as far as I am able to.

ctsa commented 5 years ago

Thanks for the updates and analysis @ernfrid & @jkbonfield , unfortunately no one on our team is able to prioritize the issue either as we don't have a CRAM production support requirement.

I agree this is still an important issue we'd like to address and should remain open. Some notes for whomever is able to pick this up next:

https://github.com/Illumina/manta/blob/a922b746a0fe603ae5ad3a284b00b8e735b72709/redist/CMakeLists.txt#L77

https://github.com/Illumina/manta/blob/a922b746a0fe603ae5ad3a284b00b8e735b72709/redist/CMakeLists.txt#L80

ASAN_OPTIONS=detect_leaks=1

More details on LeakSanitizer options can be found here:

https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer