adamewing / bamsurgeon

tools for adding mutations to existing .bam files, used for testing mutation callers
MIT License
231 stars 86 forks source link

addsv.py Spike-Ins #115

Closed charliecurnin closed 6 years ago

charliecurnin commented 6 years ago

Hi Adam,

I tried to spike-in one indel per size [149, 1000] bp in my 151 bp read WGS BAM.

python $addsv -v $svs -f $bam -p 8 -r $fasta --aligner mem -o $o

According to the VCF I genereated with makevcf_sv.py, 544 out of 852 mutations were made. A few things to note:

  1. The sizes reported in the VCF in theSVLEN tag (for deletions) are not what I would expect. Of the 270 deletions that were spiked-in successfully, 56 have an SVLEN=1800, 53 have an SVLEN=1799. Are the deletions that are being spiked-in actually this size? Or are they being misreported. I'll look more into this too.

  2. 264 of the 852 mutations in my $svs file encountered the error best contig too short to make mutation!. It looks like in this issue you recommend lowering the -l value, so I'll try that?

  3. Of the other 44 that weren't spiked in: 8 have generated no contigs, skipping site; 32 have contig dropped due to ambiguous base (N), aborting mutation... and the rest are a bunch of different errors. I'll look at these sites to see if they're just bad places to spike-in.

  4. Of the mutations that were spiked-in and reported in the VCF, most look great. At the very least, something clearly happened at each of them. Many of the insertions, though, look kind of like less drastic versions of the insertion we discussed in this issue: a sharp drop or increase in coverage, and asymmetric soft-clipping. A few examples:

chr22_31 114 309_31 115 309

chr22_27 489 550_27 490 550

chr22_27 158 039_27 159 039

chr22_26 346 972_26 347 972

adamewing commented 6 years ago
  1. I think SVLEN is referring to the size of the window passed in the variant file or the size of the contig and not the insertion itself.

  2. Before decreasing -l, try increasing the size of the windows in the input variant file.

  3. Problems with contig generation is most often due to repetitive sequences (recent transposons, satellites, homopolymers, etc) where you're trying to mutate.

  4. I think these look pretty good, there's a bit of a coverage 'shelf' in some of them but I see this in real insertions sometimes as well. I do notice that a lot of the altered reads are mapping as improper pairs - you might have a look at the insert size of your BAM file and adjust --ismean and --issd to match.

charliecurnin commented 6 years ago

Re. 4: I tried to find the mean and sd in insert size with awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}'.

Calculating across the entire $bam gave mean=2618.68, stdev=173786. Running addsv.py with this --ismean and --issd resulted in no successful mutations. Most of the spike-ins seem to results in [wgsim_core] skip sequence 'target' as it is shorter than 523976!. But interestingly, makevcf_sv.py doesn't appear to be sensitive to these errors as it output a vcf with ~543 variants.

Calculating after head -n 10000 gave mean=552.171, stdev=8074.31. Running addsv.py with these settings, too, resultsed in no succesful mutations and the same [wgsim_core] errors.

Is this because the insert size and sd I'm specifying are wrong?

adamewing commented 6 years ago

Hi, sorry, it's best to calculate insert size with a tool like CollectInsertSizeMetrics from the picard suite to avoid the metrics being driven by outliers. My guess is it's not going to be too far off the defaults.

charliecurnin commented 6 years ago

OK, so Picard reported --ismean 356 --issd 116. Here are the results from the same insertion (the last one appears to have been shifted a bit). I think they look improved? There's fewer blue reads—those are the improper pairs?

chr22_31 114 309_31 115 309

chr22_27 489 550_27 490 550

chr22_27 158 039_27 159 039

chr22_26 347 070_26 348 070

adamewing commented 6 years ago

Yes hard to say, was just a thought. How did you go with increasing the window size in the variant file?

charliecurnin commented 6 years ago

Tried increasing the window by 500bp.

chr22_31 114 309_31 115 309

chr22_27 489 550_27 490 550

chr22_27 158 039_27 159 039

The last mutation was now not made: best contig too short to make mutation!

adamewing commented 6 years ago

Ah. Drat. Usually, the bigger the window size, the better the chance it has to assemble a contig that's long enough to spike-in a mutation. As with indels, the expedient thing to do if you don't need specific sites is to request more mutations than you need and the "truth" VCF will only reflect mutations that were actually made.

charliecurnin commented 6 years ago

Got it! Thanks.