JaneliaSciComp / msg

Multiplexed Shotgun Genotyping
http://genomics.princeton.edu/AndolfattoLab/MSG.html
11 stars 12 forks source link

Compilation of simple fixes and first attempt at BWA-MEM support #42

Closed YourePrettyGood closed 7 years ago

YourePrettyGood commented 8 years ago

Hi David, Greg, Here's a compilation of a few simple bug fixes as well as the (conflict-adjusted) commits for incorporating BWA-MEM into both parental genome updating and read alignment for individuals from the cross.

The bug fixes include:

Added features: MSG now supports modern versions of Perl, Python, Pysam, and R. Previous versions of MSG also supported modern versions of BWA, since modern BWA still has 'aln' and the syntax for indexing hasn't changed.

I'm currently testing the BWA-MEM implementation, but the testing also incorporates several other larger changes (mostly increased logging, and generalization of cluster submission so that we can use it with SLURM, and others might be able to use it with PBS or LSF, etc.). So we can obviously wait on merging the pull request until that's done. I'll try an equivalent run using this branch and cluster=0 (since cluster=1 won't work on our SLURM clusters) to validate the changes afterwards.

Best regards, Patrick Reilly

P.S. We may be able to create a more comprehensive Makefile to simplify installation of MSG and dependencies, since (pending testing results) these commits should make MSG compatible with all modern versions of its dependencies, with the exception of samtools (and samtools is easy to install locally).

P.P.S. This pull request does not include some significant memory usage reductions in summaryPlots.R. It'll take some time to rejigger those changes into this branch and then create a pull request for them. Still figuring this whole Git thing out...

dstern commented 8 years ago

Patrick, This sounds great. Nice work. Let me know once you confirm that bwa-mem works and I will commit these changes. Cheers, David

On Aug 28, 2016, at 8:20 PM, Patrick Reilly notifications@github.com wrote:

Hi David, Greg, Here's a compilation of a few simple bug fixes as well as the (conflict-adjusted) commits for incorporating BWA-MEM into both parental genome updating and read alignment for individuals from the cross. The bug fixes include:

A couple of typos in extract_ref_alleles.py (Elyssa Garza encountered this error indirectly) Fixing deprecated Perl syntax in parse_BCdata2BWA.pl (Jason Bosch encountered this error, due to a syntax change as of Perl 5.14.0) Fixing deprecated R syntax in summaryPlots.R (Elyssa Garza encountered this error, due to a syntax change from R 2 to R 3) Fixing buggy grep syntax in write-hmm-data.R (We encountered this error at Princeton, due to an argument interpretation "bugfix" in bash or grep, forget which version) Very minor bug in reading FASTAs with headers that can potentially be interpreted as "False" values (e.g. 0) found in Utils.pm

Added features: MSG now supports modern versions of Perl, Python, Pysam, and R. Previous versions of MSG also supported modern versions of BWA, since modern BWA still has 'aln' and the syntax for indexing hasn't changed. I'm currently testing the BWA-MEM implementation, but the testing also incorporates several other larger changes (mostly increased logging, and generalization of cluster submission so that we can use it with SLURM, and others might be able to use it with PBS or LSF, etc.). So we can obviously wait on merging the pull request until that's done. I'll try an equivalent run using this branch and cluster=0 (since cluster=1 won't work on our SLURM clusters) to validate the changes afterwards. Best regards, Patrick Reilly P.S. We may be able to create a more comprehensive Makefile to simplify installation of MSG and dependencies, since (pending testing results) these commits should make MSG compatible with all modern versions of its dependencies, with the exception of samtools (and samtools is easy to install locally). P.P.S. This pull request does not include some significant memory usage reductions in summaryPlots.R. It'll take some time to rejigger those changes into this branch and then create a pull request for them. Still figuring this whole Git thing out...

You can view, comment on, or merge this pull request online at:   https://github.com/JaneliaSciComp/msg/pull/42 Commit Summary

Merge pull request #1 from JaneliaSciComp/master Bug fix for structurally-varying parentals Bug fix for FASTA files with a sequence labeled "0" Minor grep bug fix Fix for new versions of rainbow Fix for Perl >= 5.14 First attempt at adding support for BWA-MEM Python line wrapping issue with BWA-MEM incorporation File Changes

M Utils.pm (20) M extract-ref-alleles.py (319) M filter-sam.py (62) M hmmlib.R (6) M msg.pl (8) M parse_BCdata2BWA.pl (4) M parse_and_map.py (21) M write-hmm-data.R (2) Patch Links:

https://github.com/JaneliaSciComp/msg/pull/42.patch https://github.com/JaneliaSciComp/msg/pull/42.diff

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/JaneliaSciComp/msg","title":"JaneliaSciComp/msg","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/JaneliaSciComp/msg"}},"updates":{"snippets":[{"icon":"DESCRIPTION","message":"Compilation of simple fixes and first attempt at BWA-MEM support (#42)"}],"action":{"name":"View Pull Request","url":"https://github.com/JaneliaSciComp/msg/pull/42"}}}

David Stern

Group Leader

Janelia Research Campus

19700 Helix Drive

Ashburn, VA 20147

www.janelia.org/lab/stern-lab

HowToGiveATalk.com

@David_L_Stern

YourePrettyGood commented 8 years ago

Hi David, I need to add in a couple of commits to this pull request, as there were a few bugs I missed in review. One was particularly egregious (the sign of the comparison used for AS-XS was flipped, so before that commit, MSG was keeping the repetitive reads and filtering the unique ones), the others were simple oversights.

I've tested these commits on my master branch, and got the following results for the Toy example dataset: aln_indivA12_AATAAG-hmmprob.pdf aln_indivE2_CAGCCG-hmmprob.pdf aln_indivE4_TAGGAG-hmmprob.pdf aln_indivE8_CTAACG-hmmprob.pdf aln_indivE12_GTATCG-hmmprob.pdf aln_indivF11_GTTACG-hmmprob.pdf aln_indivG7_CTTGCG-hmmprob.pdf aln_indivH6_GTCCGG-hmmprob.pdf aln_indivH9_GTCGCG-hmmprob.pdf aln_indivH12_TTGACG-hmmprob.pdf mem_indivA12_AATAAG-hmmprob.pdf mem_indivE2_CAGCCG-hmmprob.pdf mem_indivE4_TAGGAG-hmmprob.pdf mem_indivE8_CTAACG-hmmprob.pdf mem_indivE12_GTATCG-hmmprob.pdf mem_indivF11_GTTACG-hmmprob.pdf mem_indivG7_CTTGCG-hmmprob.pdf mem_indivH6_GTCCGG-hmmprob.pdf mem_indivH9_GTCGCG-hmmprob.pdf mem_indivH12_TTGACG-hmmprob.pdf

One of the 10 individuals has a different tract pattern between aln and mem, and updating of the parental genomes was slightly different: Updating results for MEM run: 85,247 sites updated in par1 84,285 sites updated in par2 Updating results for aln run: 85,216 sites updated in par1 83,923 sites updated in par2

As a note, I didn't use Stampy for updating, but instead used bwa alone, so the results for both cases differ slightly in terms of markers from the results found here: https://github.com/JaneliaSciComp/msg/tree/master/example_MSG_toy/example_results

Peter mentioned we don't truly know the ancestral states of these toy example individuals, and recommended three other tests of mem vs. aln. Let me know what your take is on them, and any other tests to do. 1) Test different values for the AS-XS threshold on the toy example to see if different read filtering makes mem look more like aln 2) Run both MSG on real parental data (we have some for Dsim w501 and Dsim MD221), since these have known ancestral states 3) Simulate some data using simMSG, and see how well the inferences from mem vs. aln fit the simulation inputs

I'll try the simple-fixes branch with cluster=0 on the same cluster, just to make sure these changes work without the other changes I made for logging and SLURM-compatibility.

Version notes: Perl 5.10.1 Python 2.7.3 Pysam 0.9.0 six 1.9.0 Biopython 1.59 R 3.2.3 R.methodsS3 v1.7.1 R.oo v1.20.0 zoo v1.7-13 BWA 0.7.12-r1044 Samtools 0.1.9 (r783)

Best regards, Patrick Reilly

dstern commented 8 years ago

Here's an example of failure to plot for bwa aln indivbcsec-LP-2_AACGCCAACTACGAGC-hmmprob.pdf You can see that chr 2, 3, X failed. From the output file, you can see that we got a core dump at "Making pileup for par1 contig 2 samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed. msg/make-pileups.sh: line 54: 20592 Aborted (core dumped) samtools pileup -Bcf $parent1 $file-$ref-sorted.bam > $file-$ref-sorted.pileup" etc. See full file here

msgRun2.8974.o59506430.133.txt

The fastq and sam files are below. indivbcsec-LP-2_AACGCCAACTACGAGC.gz aln_indivbcsec-LP-2_AACGCCAACTACGAGC_par1.sam.gz

dstern commented 8 years ago

the bwa mem error is different from bwa aln error. Both failed with this error Command 'samtools view -Sh -q 20 -o ./reads.fq_sam_files/aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par2.sam.mapq_filtered.sam ./reads.fq_sam_files/aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par2.sam' returned non-zero exit status -6

Failure was apparently in generating the "_noncommon.reads.removed.sam" file or perhaps in completing production of "_sam.mapq_filtered.sam" file, though I checked the "*sam.mapq_filtered.sam" files and they are identical to the original sam file.

Here's an example sam file that failed

aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par1.sam.gz

YourePrettyGood commented 8 years ago

Hmm, this is peculiar. I can't find any cases of CIGAR strings with anything other than the M, I, and D operations in the aln SAM file (I would have expected S operations, but maybe that's not on by default like it is for mem). I also can't find anything odd in the mem SAM file with regards to CIGAR strings, and ran the command specified using samtools 0.1.9 without error (exit status was 0 based on echo $? after running the command samtools view -Sh -q 20 -o test.sam aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par1.sam.gz). And, like you said, the MAPQ filtering didn't do anything, as confirmed by diff -q <(zcat aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par1.sam.gz) <(cat test.sam).

So for the first error, would you mind sending the par1 reference FASTA? I should be able to recreate the sorted BAM used in the pileup call using that reference FASTA, and re-running that pileup command should help narrow down the problem.

And for the second error, would you mind sending the par2 SAM? It looks like the par1 SAM got uploaded there, but the error message is for par2.

dstern commented 8 years ago

On Sep 6, 2016, at 12:45 PM, Patrick Reilly notifications@github.com wrote:

Hmm, this is peculiar. I can't find any cases of CIGAR strings with anything other than the M, I, and D operations in the aln SAM file (I would have expected S operations, but maybe that's not on by default like it is for mem). I also can't find anything odd in the mem SAM file with regards to CIGAR strings, and ran the command specified using samtools 0.1.9 without error (exit status was 0 based on echo $? after running the command samtools view -Sh -q 20 -o test.sam aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par1.sam.gz). And, like you said, the MAPQ filtering didn't do anything, as confirmed by diff -q <(zcat aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par1.sam.gz) <(cat test.sam). So for the first error, would you mind sending the par1 reference FASTA? I should be able to recreate the sorted BAM used in the pileup call using that reference FASTA, and re-running that pileup command should help narrow down the problem.

Here’s the link

https://www.dropbox.com/s/qrmz1jkwmpmx0ua/parent1_ref.fa.gz?dl=0

And for the second error, would you mind sending the par2 SAM? It looks like the par1 SAM got uploaded there, but the error message is for par2.

https://www.dropbox.com/s/38zd35tgxjyge4o/aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par2.sam.gz?dl=0

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/JaneliaSciComp/msg","title":"JaneliaSciComp/msg","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/JaneliaSciComp/msg"}},"updates":{"snippets":[{"icon":"PERSON","message":"@YourePrettyGood in #42: Hmm, this is peculiar. I can't find any cases of CIGAR strings with anything other than the M, I, and D operations in the aln SAM file (I would have expected S operations, but maybe that's not on by default like it is for mem). I also can't find anything odd in the mem SAM file with regards to CIGAR strings, and ran the command specified using samtools 0.1.9 without error (exit status was 0 based on echo $? after running the command samtools view -Sh -q 20 -o test.sam aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par1.sam.gz). And, like you said, the MAPQ filtering didn't do anything, as confirmed by diff -q \u003c(zcat aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par1.sam.gz) \u003c(cat test.sam).\r\n\r\nSo for the first error, would you mind sending the par1 reference FASTA? I should be able to recreate the sorted BAM used in the pileup call using that reference FASTA, and re-running that pileup command should help narrow down the problem.\r\n\r\nAnd for the second error, would you mind sending the par2 SAM? It looks like the par1 SAM got uploaded there, but the error message is for par2."}],"action":{"name":"View Pull Request","url":"https://github.com/JaneliaSciComp/msg/pull/42#issuecomment-245012855"}}}

David Stern

Group Leader

Janelia Research Campus

19700 Helix Drive

Ashburn, VA 20147

www.janelia.org/lab/stern-lab

HowToGiveATalk.com

@David_L_Stern

YourePrettyGood commented 8 years ago

Okay, for the mem error, there's something really really flukey going on there. I reran the samtools view command for the par2 sam, and it gave this: samtools view -Sh -q 20 -o test_par2.sam aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par2.sam.gz [samopen] SAM header is present: 19601 sequences. Parse error at line 256499: CIGAR and sequence length are inconsistent Aborted (core dumped)

Then taking a look at that line of the SAM: zcat aln_indivbcsim-LP-52_GGAGGTTTTCGAAGCT_par2.sam.gz | awk 'NR==256499' HWI-ST651:525:HKM7YBCXX:2:2109:9137:10777 0 2 0 2 268435455S101S * 0 0 AATTAAAGAAAGCAAATGTACAAATTAAAGTAAGAAATTTTGAAAAGTGTTTTCCCGTCGGGAAAACCAAATTTTTGAAAGTGCCAAAATTTGAAACGAG GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIG NM:i:8388607 AS:i:50 XS:i:47

That alignment is so incredibly wrong, I have no idea how to explain it. How can you have 8 million mismatches in an alignment? And how do you have 268 Mb of sequence soft-clipped, and then another 101 bases soft-clipped, let alone having no aligned sequence there...

That's a curious issue, but is most likely a BWA error. Have you re-run that one from scratch already? Maybe it was just a transient issue.

I'll keep working on the aln issue though.

YourePrettyGood commented 8 years ago

So after some detective work, it looks like the aln and mem issue may be a bug in samtools 0.1.9 that was fixed in samtools 0.1.10. Or at least that's what I'm reading into line 525 of this Changelog: https://github.com/lh3/samtools/blob/master/ChangeLog

The error-causing alignment here is: HWI-ST651:525:HKM7YBCXX:2:2201:11841:75841 16 2 18740186 37 2D4M3I93M * 0 0 GAGACAGTGGATGGCCAGTTGAGGCTTTCCACTTGACATGGCAGCGGGAAATTTATTAGTTGTGTCAGCCGAGTGTCAAACGAAATCAGCATTAAAACCC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGG XT:A:U NM:i:6 X0:i:1 X1:i:0 XM:i:4 XO:i:1 XG:i:1 MD:Z:0^AT1G95

The CIGAR string starts with a D, which is what causes samtools pileup to fail.

There are 2 potential workarounds: 1) Update to samtools 0.1.10 (r841), which should still have pileup, but has this samtools pileup bug fixed 2) Remove the causal reads/alignments from consideration

Option 2 seems much more computationally intensive, but might have simple implementation. A rough sketch of the implementation might be: awk '/^@/{print;}!/^@/{if ($6 !~ /^[0-9]+D/) {print;}; }' [SAM file here] > [revised SAM file here]

That might be used as a pre-filter between mapping and extract-ref-alleles.py.

However, option 1 seems simpler to do for your case.

Just another reason it'd be good to switch MSG over to samtools mpileup (and modern versions of samtools).