EichlerLab / smrtsv2

Structural variant caller
MIT License
53 stars 6 forks source link

High False negative rate #46

Closed AyushSaxena closed 4 years ago

AyushSaxena commented 4 years ago

Please let me know what seems to be wrong with my logic. In an attempt to quantify false negative rates in the pipeline, I simulated our reference genome with 100 randomly placed 100-bp insertions (Overall I've added 350 insertions of varying sizes in the reference genome, the other sizes being 1kb, 10kb and 100kb). I hope to recover these as deletions in my PacBio data.

For a quick-and-dirty search of 'recovered' mutations, I was looking for an increment in the number of deletions called (size 100 +/- 5), without worrying about the location on the genome for the time being, and I only recovered 24 extra ~100-bp deletions as compared to the one I had against the original reference genome.

These are the parameters I've been using to call variants so far (I thought I've been using a fairly 'liberal' variant calling pipeline) -

smrtsv --tempdir ${TMP} --verbose run --runjobs 10,10,5,5 --batches 16 --threads 64 --species Elegans \ --asm-polish arrow --min-support 3 --min-hardstop-support 3 \ --asm-cpu 12 --asm-mem 20G --min-length 30 --sample PB306.ST1.WS270 --asm-rt "01:10:00:00" --asm-group-rt "02:00:00:00" \ ${reference} ${inputs}

(Giving it an unreasonably long time to assemble has reduced my assembly error rates across samples) I believe the key parameters the user can control here are --min-support 3 --min-hardstop-support 3 --min-length 30

And I was expecting that keeping minimum support and minimum hardstop support low would decrease my false-negative rate at the expense of my false-positive rate, but my false negative rate is quite high (76/100).

With larger variants (size 1kb/10kb/100kb), I'm hardly recovering any.

Can you please point out a possible mistake I'm making in using this tool? I am also not sure if I'm losing variants I ought to recover during detection or post-assembly. I can attach whichever files you'd like me to.

Thank you Ayush

paudano commented 4 years ago

Are you simulating heterozygous or homozygous calls? SMRT-SV has a high false-positive rate driven almost entirely by missed heterozygous SVs. This is because it relies on unphased assemblies, and an SV may or may not be represented in the assembled contig. I always back it up with another caller (PBSV is a good choice) or run a de novo phased assembly if I can.

SMRT-SV calls variants from the reference-guided assemblies in windows across the genome. The detection step adds additional windows over SVs, and those parameters probably won't have a large effect.

Reads with very large SVs may not map or assemble well, so it may miss some of the 100 kbp SVs. Most of the 100 bp, 1 kbp, and 10 kbp SVs should be attainable.

I would start with a false-positive call (choose a 100 or 1 kbp simulated SV). Then look at the read and contig alignments. Read alignments are in "align/bam", and contig alignments are in "assemble/local_assemblies.bam". If you see a drop in read or contig coverage, then something went wrong with mapping or assembly.

AyushSaxena commented 4 years ago

Hi, maybe there is a confusion. I'm not simulating SV's in my PacBio reads, I'm adding random mutations to the reference genome, which I assume will always be homozygous. I'm doing it on the C.elegans reference genome (WS270), and there is only one fasta sequence per chromsome, so I don't even know how I'd go about introducing a heterozygous mutation in the reference genome to be honest. I'm interested in calling mutations in highly inbred worms, so I'm only interested in near-perfect homozygous calls.

"SMRT-SV has a high false-positive rate driven almost entirely by missed heterozygous SVs" Do you mean false negative rates? I'm confused.

I've also tried whole genome assembly, that I've followed up with Assemblytics, which seems to recover inserted mutations fairly well (the downside being it doesn't call any inversions). The Assemblytics pipeline also doesn't deal with any variant in the 100-kb range. But as you pointed out, 100, 1kb, and 10kb was feasible.

Almost all our data is roughly 70-100X, with median read lengths of 9-10 Kbs across samples. I'm dealing with a 100MB genome. What parameters would recommend for (min support, min hardstop support). I'm only interested in homozygous calls.

I'd look up the aligned reads and local assembly for a couple of variants and eye-ball it. Thanks for recommending that. I've used the default BLASR alignment parameters so far (Arguments:

Thank you Ayush

AyushSaxena commented 4 years ago

For one of the inserted mutations in the reference (so now I'm expecting a deletion in my data), I've found roughly ~113x data post alignment from BLASR and during the Canu assembly of the region. I'm only expecting a 100-bp deletion. The depth of assembled contigs in the region is 4 (I'm not really sure what that means; Are there 4 different assembled contigs that map to the same region?).

paudano commented 4 years ago

I see, you are simulating homozygous variants. Yes, I meant it has a high false-negative rate from assemblies. Sorry for the confusion.

I expect the default parameters to work fine. I have never had to tune them, but I have only run on human. If you see something strange, like missing windows or bad assemblies, parameter tuning may help that.

Yes, you will have multiple contigs per locus. Here's the SMRT-SV process.

1) SMRT-SV tiles 60 kbp windows over the genome, sliding by 20 kbp (3 windows covering each locus). 2) The detection step finds SV signatures and adds additional windows (looks like it added one). 3) For each window, reads are pulled from the alignment, assembled, and aligned back to the reference. 4) SVs are called from the contigs.

For one SV event, SMRT-SV tells you how many contigs were aligned to the location (CONTIG_DEPTH) and how many contigs had the SV (CONTIG_SUPPORT). I filter anything with CONTIG_SUPPORT == 1 (high false-positive rate otherwise).

Your aligned contigs is what I would expect. Are they all contiguously mapped over the SV region?. Can you view the alignments and make sure you see the deletion?

AyushSaxena commented 4 years ago

Hey,

In my previous email, I gave you the wrong information (when I eye-balled the SV; I was looking at the wrong chromosome!). It turns out that the variants I had put in the reference genome (I have eye-balled 5 or so), are being missed during detection phase. The sites in the reference genome where I introduced mutations aren't present in any of the assembled contigs.

I looked at the "candidates.bed" file in the "detect" directory to find the assembly group for my locus, and I couldn't find any. Then, I checked the "gaps/gaps_del.bed" file (for the deletion I expect after simulating an insertion in the reference genome), and I couldn't find it. I've eyeballed the 100 bp and 1 kb size category so far.

AyushSaxena commented 4 years ago

Silly question: Is the tool itself geared towards being 'conservative'? I recall a plot from the Cell manuscript, and we can observe diminishing returns in the number of 'non redundant' SV's as the sample size increases. That observation is consistent with no/few false-positives, but is agnostic w.r.t false negatives.

I'm using this pipeline to call new/spontaneous mutations from a mutation accumulation experiment, and our lines barely diverge from the ancestor (0.0001% different). So, our results are very sensitive to false positive and false negative rates.

paudano commented 4 years ago

The detect directory is not the first place to look. It only generates windows for the assembly step. Even if detect misses an SV, there will still be 3 assembly windows that cover it (see above). The first place to look is at contig coverage over the SV. Do you see contigs there? If so, do they have the deletion?

If there are no contigs, then I would look at the alignment coverage in the "align" directory. If the alignments drop out, then so will the assemblies. If the alignments and contigs both look good, but there is no SV, then something else is wrong, and I would check again to make sure the SV is where you think it is.

For your artificial deletions, are you inserting "N", or random sequence into the reference? Stretches of N will break things.

The diminishing returns of non-redundant SVs is because subsequent samples are not adding as many new SVs that were not seen in previous samples. You would see this from any variant caller unless there was an extremely high false-positive and/or false-negative rate. The diminishing returns on novel SVs confirms that a large proportion of genomic variants in the population are being sampled.

AyushSaxena commented 4 years ago

For my novel insertions, I'm pulling the sequences from a different location (creating simultaneously a deletion and then an insertion).

I see your point about diminishing returns on non-redundant SV's. It does take care of both False positives and negatives.

I'll follow the steps you outlined and report back.

Thanks! Ayush

AyushSaxena commented 4 years ago

I've finished the case study on one of the simulated insertions (in the reference) that I expect as a deletion in my data.

I didn't find the region in the file "call/assembled_contigs.depth.bed" and the contig depth flanking the region (but still quite far away from the locus of interest) is 1. I didn't find this region using the old coordinates for the original genome and the new coordinates in the simulated genome. So, I'm assuming it's not covered at all.

When I look in the align folder, I checked the depth in the region of interest. In the original genome with the old coordinates, I see that the reads map well and the coverage is par for the region (my data is 224x overall). Checking the align directory in the run that included the simulated genome, I see a visible drop in coverage across all batches to roughly 1/10th or less as compared to flanking region. The drop is steep and the breakpoints match the location of the expected deletion. Because I'm expecting a deletion, ideally the coverage in the 100-bp region should be zero, but there's always a few mapping errors as I understand. I can't recall if soft clips are included in samtools depth.

The take-away is that the alignment seems to have worked as expected, but we don't have any contig coverage for the region whether I'm looking at the original genome or the simulated, which I guess, explains the false negative call.

What should I do to increase sensitivity during assembly? (Assuming other case studies find something similar; I'll check on a few more)

Thanks! Ayush

paudano commented 4 years ago

Great, this is what I needed. It's clearly a dropout in the assemblies, and I think it's because of your read depth. The assembly step limits the amount of time it spends on each window (30 minutes by default). This keeps Canu from spending days trying to assemble large tandem repeats (mainly alpha satellite arrays in human). You'll need to relax this limit with "--asm-rt". The value of the argument to "--asm-rt" is fed directly to the Linux "timeout" command, so you can use valid values for it (e.g. "6h" for a 6 hour limit).

Assembly windows are batched into megabase-groups, and the whole group is run as one job (one assembly after the other). There is also a group runtime limit, but that is only enforced by grid system if you are using one (e.g. SGE or Slurm). It's tunable with "--asm-group-rt" and defaults to "72:00:00" (72 hours).

When you increase the runtime limit of the assemblies, those groups are going to take a very long time to run. You can change the batch size with "--candidate-group-size", which defaults to "1000000". I might try 200k ("200000"). I left this parameter out of the option parser, but just added it, so you'll need to pull the latest commit (f3d50f1c0874e488f1b00222a86ffbffe01a70dd , tag 2.0.2) to use this option.

For the detect step, you may want to tune the parameters for high coverage. The defaults ("--max-coverage" = "100", "--max-support" = "100", "--min-coverage" = "5", "--min-hardstop-support" = "11", "--min-support" = "5") may need to be increased. We have used these parameters up to about 100 X, but never beyond that. Even if detect fails, you'll still have three assemblies to call variants from, so it should still find homozygous variants. In a real sample, you would see a decrease in sensitivity for heterozygous SVs.

You don't need to re-run the alignments, so remove/archive the rest (keeping the "reference" and "align" directories), then re-run the pipeline from there. It will pick up after align, then go through detect, assemble, and call.

If you want to run a series a quick tests to tune your parameters, you can generate a BED file covering all but a few regions of interest (add at least an 80 kbp flank around them) and give this to SMRT-SV with "--exclude". That will make SMRT-SV ignore everything but your regions, and the assemblies will go much faster.

AyushSaxena commented 4 years ago

OK, I've a few more questions/comments so I can wrap my head around this (I promise I read the manual!) -

  1. This was the script I was running - smrtsv --tempdir ${TMP} --verbose run --runjobs 10,10,5,5 --batches 16 --threads 64 --species Elegans --asm-polish arrow --min-support 3 --min-hardstop-support 3 --asm-cpu 12 --asm-mem 20G --min-length 30 --sample PB306.ST1.WS270 --asm-rt "01:10:00:00" --asm-group-rt "02:00:00:00" ${reference} ${inputs}

I was using a fairly lenient value of --asm-rt to begin with (1d), and I had set asm-group-rt to 2d (I didn't know the default was 3d). I was under the impression that if an assembly fails for lack of time afforded to it, I'll get some sort of an error message in the "contig_group.log" in "assemble/group/gr-III-1000000-1000000/ (say)". A common error that I used to get was "slurmstepd: error: _is_a_lwp: open() /proc/4793/status failed: No such file or directory" which the local cluster support helped me resolve by explaining that my scripts are running out of time/memory, and relaxing those constraints during assembly has solved that problem entirely (even though it's a bit more inefficient for the rest of the groups). I haven't observed any errors in the "contig_group.log" that looks like an assembly error because of coverage.

  1. "Even if detect fails, you'll still have three assemblies to call variants from, so it should still find homozygous variants" I don't think I properly understand this statement. Are you saying that SMRTSV tiles the entire genome in 20kb windows, and adds extra windows if the detect step finds something? I was under the impression that local assembly is conditioned of finding something in the detect step. But you did make a comment a couple of days ago that that it does it over the entire genome, so II just want to confirm it.

In that case, if I were to look into a megabase-sized group, I should be able to find the same locus within the group, in at least three assembly windows? As an extension, for 100MB genome, we expect roughly 100 1MB groups then? [This may be a relevant observation then that the first sample I had run, for which I have shared observation here is 224x, and only had 24 assembly groups; while the other samples I have which are about 80-100x, are giving me 105-112 assembly groups].

  1. "It's clearly a dropout in the assemblies, and I think it's because of your read depth." Just to idiot proof my understanding of this - I have too high a coverage, correct? If 224x is an overkill, I could always sub-sample and run a smaller batch (which does not offend my scientific conscience in this case, as long as the tool works fine).

  2. "For the detect step, you may want to tune the parameters for high coverage. The defaults ("--max-coverage" = "100", "--max-support" = "100", "--min-coverage" = "5", "--min-hardstop-support" = "11", "--min-support" = "5")" IF the max coverage is set to 100, does that mean the we are subsampling a 100x in the region, or is detection simply skipping any region past a 100x?

  3. I've also observed (as I view contigs.bam on IGV) that some regions with missing contigs were the result of failed assemblies that the program didn't stop for. One of the errors I saw is "Assembly error: Return code = 1". I had completely missed it because it only happened in one of the 60-kb regions of the group, and when I had skimmed the log file, it appeared OK.

Thank you for allowing me to pick your brain. This is an excellent tool and this conversation is quite illuminating. Ayush

paudano commented 4 years ago

1. In the human genome, there are very large tandem repeat arrays, especially in centromeres and pericentromeric regions. Trying to assemble these will cause Canu to take many times longer, and will result in failures and misassemblies anyway. So, SMRT-SV limits the amount of time it spends on one assembly, and if it times out, then it moves on. Otherwise, the whole pipeline would stall trying to assemble regions that provide no useful data.

When the timeouts happen, you should see "Assembly timeout" followed by the timeout limit in the group log assemble/group/GROUP-ID/contig_group.log.

Very high coverage may also cause the assemblies to take much longer, so they may time out, and you may need to adjust for it.

2. Yes, it tiles over the whole genome and adds windows where it finds signatures. It's good to adjust the parameters so signature detection is working. When signature detection (detect step) fails, the pipeline doesn't fail, but sensitivity suffers. Coverage shouldn't affect the number of groups.

3. Sure, you could subsample to 80-100X and try settings closer to the defaults.

4. It's going to throw out detect events with ultra-high coverage because it will think they are paralogous artifacts (high-identity copies mis-mapped to higher than expected coverage).

5. You should see an error message from Canu in the group log for that event (e.g. "Failed with exit code 1. (rc=256)"). Canu should crash or fail with an error message.

AyushSaxena commented 4 years ago

Thank you,

I've been able to assemble all 105 groups (100M C. elegans genome) for my 224X sample by -

  1. Subsampling, which worked fairly well
  2. by using (--max-coverage 250 --max-support 250) with my full 224x data

When I didn't use the (--max-coverage 250 --max-support 250) parameter, I believe the default is 100, then I can only find 25% of the genome covered with at least 1 contig. I also did fine assembly errors when I took a deep dive in the assembly logs that were 60-kb-specific, which could be down to high coverage or lack of time.

Thank you for helping me out. I don't think I still quite get every little quirk of the program, but I've been able to recover the mutations I had put in the reference genome and confirm it by eye on IGV.

Ayush