brentp / bwa-meth

fast and accurate alignment of BS-Seq reads using bwa-mem and a 3-letter genome
https://arxiv.org/abs/1401.1129
MIT License
143 stars 54 forks source link

paired reads not recoqnized as such #27

Closed cb4github closed 8 years ago

cb4github commented 8 years ago

Dear brentp,

Here's my SLURM script:

!/bin/bash

SBATCH --job-name=bwa-meth ### Job Name

SBATCH --output=bwa-meth24hr.out ### File in which to store job output

SBATCH --error=bwa-meth24hr.err ### File in which to store job error messages

SBATCH --qos=normal ### Quality of Service (like a queue in PBS)

SBATCH --time=0-24:00:00 ### Wall clock time limit in Days-HH:MM:SS

SBATCH --nodes=1 ### Node count required for the job

SBATCH --ntasks-per-node=20 ### Nuber of tasks to be launched per Node

echo pwd=pwd module load python/2.7.11 module load samtools REFERENCE=/lustre/project/hpcstaff/user/OryziasLatipes/GCF_000313675.1_ASM31367v1_genomic.fna.bgz echo Generating index for $REFERENCE time bwameth.py index $REFERENCE echo Analyzing methylation.. time bwameth.py --reference $REFERENCE ../sra-toolkit/SRR2050320_1.fastq.gz ../sra-toolkit/SRR2050320_2.fastq.gz -t 20 --read-group $'@RG\tID:SRR2050320\tSM:SRR2050320' echo Analysis succeeded.

...and here's the result of my alignment check... 37067603 + 3256037 in total (QC-passed reads + QC-failed reads) 63598 + 967992 secondary 0 + 0 supplimentary 0 + 0 duplicates 35373511 + 3256037 mapped (95.43%:100.00%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

Q: This says zero reads paired, true? Is this correct?

Please let me know if you need more info, thanks.

Best, CB

brentp commented 8 years ago

yes, it says 0 reads paired. What does the bwa mem output look like? it should have lots of indicators of how it was mapping, what's the insert size and pair orientation.

cb4github commented 8 years ago

By "bwa mem output" are you referring to the log output - or the bam file?

If log, then I see no indication of pairing. Here is sample log output...

[M::process] 505112 single-end sequences; 0 paired-end sequences [M::mem_process_seqs] Processed 505112 reads in 1528.182 CPU sec, 80.100 real sec

If bam, then I'm not sure what more to include other than the alignment check above.

FYI, I'm not that familiar/facile with samtools, but for now the following presumably shows a count of the paired reads from my CLI...

$ samtools view -c -F 0x4 bwa-meth.bam 38629548

Again, please let me know if I can provide more info - perhaps more specific CLI output - and thanks for all you do!

Best, CB

brentp commented 8 years ago

So you have no paired reads. I would guess that something is wrong with your 2nd fastq file. You can try just aligning that with bwa-meth and see what happens.

It could also be that your reads are not from the directional (stranded) protocol and so you might have to use an aligner that supports that. It's most common that they would be directional, but if these are from an old SRA archive, then they might not be.

cb4github commented 8 years ago

Thanks, but the data is from a fairly(?) recent study (http://www.nature.com/articles/srep21317), and thus I would expect stranded protocol, true? Is there anything else I can do to get bwa mem to recognize paired end reads?

FWIW, here are sample records from the read files...

$ zcat SRR20503201.fastq.gz | head @SRR2050320.1.1 1 length=305 TTTATGAAATGTAAAGAGAGGGAGNAAGGGGGGTANAATATTNTGGANNTGAAGAGTGNNAGATTATTANANACGANNGGATNTANATNNNNGGTNTTAAATTNTTTTATTAGGGTTANTTATANTAAAANNNTAGTTATTTTAGTTGAAGTGTTATAATTAAAGTTGTATGTGGGGTGGGGGGTATTTGTTTTTTTTGTTTTTTTTTATTTTTGTTTTTGGTTGAAAGAAATGTTTATGAGATAGNTNNGNNNTTTTTGTANTTTGTATTTTATAATTAAANGGCGTNTTAAGAAAATTATCTT +SRR2050320.1.1 1 length=305 CCCCCGGGGGGGGGGGGGGGGGGG#=CFGGGGGGG#:CFGGF#:BFG##::=FGGGGG##:BFGGGGFG#:#:BFF##::FF#:B#:B####::B#:BFFFGF#:@FFFGGGGGGGGG#8@FFC#8@FFG###66>=CEGGGGAFFFGFFGFEGGGGGGGDGFGGGGGGCEGGGGGC=BEGDGGGCFFGGFFCFCEEEGGGGGEGG=EGGFGGGGGFFGGGDDFDF769<>FFFGFFFFF4@FFF#)##(###0--:<9:2#--8?:AFFFFFFFFA<)64#((-,-#-4::?941,48<:><< @SRR2050320.2.1 2 length=307 TGGGGATTGGGTTTTAATGAGATGNAAGAAATTATNGNTTTGNAAGTNNATATATGGANNATATTAGATNTNTTAGNNTAGGNTANTTNNNNTAANATTTAAGNGTTTTGAACGTATGNAGTTTNTTTTGNNNTAATAGGTGTTTACAATTTTTTTTATATAATTTTGTTTATAAAAATAATAGTAAAAAAAAAAAAGAGATTAAGTTTTAAAAATAGATTGTAAGTTATAAGTTAAAAGTTAATANANNANNNTTTAATGTNTTGAGTGTATTTTTTAATANATAANNAAATANGAGTTGATATTA +SRR2050320.2.1 2 length=307 CCCCCGG@E@F@FFGGGFAF9FFE#=C8CFFFFF<#6#::D,#9:<C##::CFFCFCD##::CCFDFCE#:#::C<##:96,#6,#::####6::#96CDDD9#,:B==,:AFEFBD,#49BFF#:AFF+###:44A+,,CEFFEEFFAFFFDFFFGGFCFEFDFGF;DDFCCDF;@9ECDGFCDEAAC9:CC:C=:__,,491,,2+<@979C4;7+111_277:+1177:+32+++++*+37++#0##(###.-(,-)..#((-((.-5))4)-.(.3))#(,(-##(((-,#((-,())..-). @SRR2050320.3.1 3 length=306 ATAAGATTTAGAGGGAAATAAAAANAAGAAAAAGGNTTAAATNATGGNNTGAAAATATNNAAAGAAATGNANAATANNAAGGNAANAANNNNTGTNTAGAAGGNGAGAGGAAAGAAAGNGTTAGNGTATTNNNAATGAGGTATAGAAAATAGTAAGTGGGAAATAAAAAAAATTGTTTAAAAAGTGGGTAGTTGAAGGTTGAATAAATAGTTATAGAAGTATATTATAGAGAGTGTGGATGAGCGTNTNNTNNNTAAAATTANAGATAGAAAGATAAGAAATNTATANNAAGTTNTGTAGGGATTA

... and the 2nd file...

$ zcat SRR2050320_2.fastq.gz | head @SRR2050320.1.2 1 length=133 AATNCACTAAACTACTACCACCCTTCACATACATAAAAAATACAATTAAATATTTTATCCATATAAAATATNCTATTATTAATTTTAATACTACATATCNNNNNNNNNNNNNNNNNNNNNNNNNNNTANNNTT +SRR2050320.1.2 1 length=133 CCC#8:DDFCFGGGGGGGGGG@FF@EFDFEGGGGGGFCF=FFGGGAEE=D,DD,44D>CEFEGCEGAGGF8#91==1EF8,=@,3CC?FGG8F,=D,++###########################05###)) @SRR2050320.2.2 2 length=107 TACNTCAATATTTCCAACAAACCAATATTTCAAAATATCCCCTAAAACCACTTTTAAAAATCGTTTTGTTANATAAAATACTAAATAAAAAATAAACTTNNNNNNNN +SRR2050320.2.2 2 length=107 CCC#8CFGFGFGGGGGGFGGGGGGGGFDEFEFGEGGGGGGGGGGFDGGGFFE@9=DDDEDGGG@=DCFDGF#99A68BFFGGGGGFDFG8D>?DDFFFF######## @SRR2050320.3.2 3 length=183 AAANCTTCGACCAGATCGAAAAATACACCGTCCTTCGAAAACCTAATAAAACCAATAAAAATCAAAACCCANACATTTTCCACCTATTCTCCCTCTTAANNNNNNNNNNNNNNNNNNNNNNNNNNNTANNNTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNTNNNNNNNNNNNN

Also, per zcat both files have the same number of records - 78584100.

Best, CB

brentp commented 8 years ago

oh, well that makes it easy. bwa mem will require that both reads have the same name. So you'll have to do some mods to the files to make sure. In your case, you have e.g.: @SRR2050320.1.1 1 length=305 in the first file paired with: @SRR2050320.1.2 1 length=133 in the 2nd file

those will have to be changed to, e.g. just : @SRR2050320.1

cb4github commented 8 years ago

Thanks, but just to confirm - the bwa-meth pipeline will feed to bwa mem the converted reads prospectively all identically named @SRR2050320 (vice pair-wise e.g. @SRR2050320.1, @SRR2050320.2, etc.), true?

Also, as such will the paired end reads be interleaved presumably so that bwa mem will consider the pairs of adjacent reads as being paired?

Best, CB

brentp commented 8 years ago

I made an edit to the comment to use the ".$n" suffix where $n is the pair number. So both ends of the read will be: @SRR2050320.1 or @SRR2050320.12322, for the 1st pair and the 12322 pair. And yes, those will be interleaved and sent to bwa mem

cb4github commented 8 years ago

@brentp I've renamed the reads per the following bash function...

function rename-reads () {
    perl -wpe 's/(SRR2050320\.\d+)\.\d+/$1/' <(zcat $1) | gzip -c > $2
}

...where $1 is the input file name and $2 is the output file name, but the resulting read files resulted in...

Exception: bad or empty fastqs

or perhaps same as error 21 again. Any idea why?

For now, I am currently downloading the read files yet again via the sra-toolkit - this time without the read id appended to the read name, i.e. @SRR2050320.12322 vice @SRR2050320.12322.1.

I'll then retry bwa-meth with that latest download and report back, thanks.

cb4github commented 8 years ago

@brentp Ok, I'm now getting Exception: bad or empty fastqs with said latest read files - with read names e.g. @SRR2050320.12322 vice @SRR2050320.12322.1.

Also, just looking at the first few records, I see no difference between said latest read files and the ones that I generated from said rename-reads function.

Any suggestion re debug would be appreciated, thanks.

brentp commented 8 years ago

can you paste the first 20 lines in each fastq?

cb4github commented 8 years ago

Per request. I've numbered the lines - hope it helps - if not I can repost without numbering. $ zcat SRR2050320_1.fastq.gz | head -20 | tail -n +1 | nl

     1  @SRR2050320.1 1 length=305
     2  TTTATGAAATGTAAAGAGAGGGAGNAAGGGGGGTANAATATTNTGGANNTGAAGAGTGNNAGATTATTANANACGANNGGATNTANATNNNNGGTNTTAAATTNTTTTATTAGGGTTANTTATANTAAAANNNTAGTTATTTTAGTTGAAGTGTTATAATTAAAGTTGTATGTGGGGTGGGGGGTATTTGTTTTTTTTGTTTTTTTTTATTTTTGTTTTTGGTTGAAAGAAATGTTTATGAGATAGNTNNGNNNTTTTTGTANTTTGTATTTTATAATTAAANGGCGTNTTAAGAAAATTATCTT
     3  +SRR2050320.1 1 length=305
     4  CCCCCGGGGGGGGGGGGGGGGGGG#=CFGGGGGGG#:CFGGF#:BFG##::=FGGGGG##:BFGGGGFG#:#:BFF##::FF#:B#:B####::B#:BFFFGF#:@FFFGGGGGGGGG#8@FFC#8@FFG###66>=CEGGGGAFFFGFFGFEGGGGGGGDGFGGGGGGCEGGGGGC=BEGDGGGCFFGGFFCFCEEEGGGGGEGG=EGGFGGGGGFFGGGDDFDF769<>FFFGFFFFF4@FFF*#)##(###0--:<9:2#--8?:AFFFFFFFFA<)64#((-,-#-4::?941,48<:><<
     5  @SRR2050320.2 2 length=307
     6  TGGGGATTGGGTTTTAATGAGATGNAAGAAATTATNGNTTTGNAAGTNNATATATGGANNATATTAGATNTNTTAGNNTAGGNTANTTNNNNTAANATTTAAGNGTTTTGAACGTATGNAGTTTNTTTTGNNNTAATAGGTGTTTACAATTTTTTTTATATAATTTTGTTTATAAAAATAATAGTAAAAAAAAAAAAGAGATTAAGTTTTAAAAATAGATTGTAAGTTATAAGTTAAAAGTTAATANANNANNNTTTAATGTNTTGAGTGTATTTTTTAATANATAANNAAATANGAGTTGATATTA
     7  +SRR2050320.2 2 length=307
     8  CCCCCGG@E@F@FFGGGFAF9FFE#=C8CFFFFF<#6#::D,#9:<C##::CFFCFCD##::CCFDFCE#:#::C<##:96,#6,#::####6::#96CDDD9#,:B==,:AFEFBD,#49BFF#:AFF+###:44A+,,CEFFEEFFAFFFDFFFGGFCFEFDFGF;DDFCCDF;@9ECDGFCDEAAC9:CC:C=:**,,491,,2+<@979C4;7+111*277:+1177:+32+++++*+37++#0##(###.-(,-)..#((-((.-5))4)-.(.3))#(,(-##(((-,#((-,())..-).
     9  @SRR2050320.3 3 length=306
    10  ATAAGATTTAGAGGGAAATAAAAANAAGAAAAAGGNTTAAATNATGGNNTGAAAATATNNAAAGAAATGNANAATANNAAGGNAANAANNNNTGTNTAGAAGGNGAGAGGAAAGAAAGNGTTAGNGTATTNNNAATGAGGTATAGAAAATAGTAAGTGGGAAATAAAAAAAATTGTTTAAAAAGTGGGTAGTTGAAGGTTGAATAAATAGTTATAGAAGTATATTATAGAGAGTGTGGATGAGCGTNTNNTNNNTAAAATTANAGATAGAAAGATAAGAAATNTATANNAAGTTNTGTAGGGATTA
    11  +SRR2050320.3 3 length=306
    12  CCCCCGGGGGGGGGGGGGGGGGGG#=DFGGGGGGG#:CFGGF#:CFF##::@FFGGGG##:CFGGGGGG#:#:CFG##:CDF#:C#:C####9::#9CFFGGG#:BFGGGGGGGGGGD#:BFFF#4AFGG###::DFGGGGGGFGGGFGGGGGGFGFGGGGGGGGGGGGGGGGGFGBFGGGGGGCFFFGGGFFFGFGFGGGFEGGFFGGGGGGGGGGGGGGGGGGFGFFGF>DFGFFFFAFFFFF5#.##(###.//67?=8#/6,38AA9>:<AFAFAA?)#(-81##(-442#(.446)8::?F
    13  @SRR2050320.4 4 length=310
    14  ATTTTATTTTCAATGATAAGGTAGNATATAGGATTNGATTTANATGTNNAGTTTAAGANNGTTTTAATTNTNGGATNNAAGANAAGAGNTNNTTTNGATAGGTNATAGTATATGATAAATAGGTNTAGGATANAGGGAAATATAGGTTATAAATGTATGAGGAGGGGAAGTATAGGTGGGAATAATTAGGGATGATTTTTTTTTGTATATTAAATAATTTATTAGAATAAATGAATATATACAACANANNAANNAATTATAGTATTATATAATATATATTTAATAAAANATATAAATAATTTTAAGTAGT
    15  +SRR2050320.4 4 length=310
    16  BCCCCGGCGG<<FGGGGFGCFGGG#:CC@DF9CCC#:CCCFG#:C96##::C,CFGAC##9:C,6CFFE#:#96,C##::6,#,:CD,#:##::=#:69B=8F#:BFDEDGGG<FFFFFG<F8F#::D+AF<#:++4:=FFGGGGFDFGGGGGGGGGGFGCFF+B>@CE8EDCFD9>3@+@8BBFF;E8,@ECC87F:3@EEGG*;,>,B:7<,:2:+=@++++++11=++11+4++=+11186)/#)##(0##.(/23)))15)7)./.93??CF?E).1<)5))))#,(43(.6<)4)).)),).)))
    17  @SRR2050320.5 5 length=310
    18  TTTGTTTTGATTTTGGGTAATAGANTAAAGTTGTTNAGTAAANGAAANNTTTTGTGGANNGAAGTTTGTNTNAGTTNNTTAANTTTTANTNNGAANACGAAGTNTATTTTGTTTTTGTAGAAAANGGAGTTTNGTTGATTTTGTGATCGGTGAAGATTCGGATATAGTTTTTCGTTAGGATATTTTAGAGTCGGATGGTGCGGTTAGACGAGTTTGTAATTACGTAGTTGGAGTTGGGGTGGAAACNGNNGTNNGTGATATTAGTTAAGTGATCGGTAAATATTCGTANGGGTTGGTAGTGATTCGTCGG
    19  +SRR2050320.5 5 length=310
    20  CCCCCGGGGGGGGGGGGGGGGGGG#:CFGGFGGGG#:DFGGG#:CFG##:CFGGGGGG##:CFFFGGGF#:#:CFG##:CFF#:CFGG#:##::D#:CFFGGG#:BFGGGGGGGGGGGGGGGGG#:?DFFFG#:ADFGGGGGGGGGGGGGGGGGGGGGG:FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGCDFFGFDGGGGFFGDCFFGFAFFAEFEFFFFF)#0##/)##.-4:?F2<AAABBF<:>B>B:<<61>B<AA>?F0#--44:9>BAAFFBA<<63:>,

$ zcat SRR2050320_2.fastq.gz | head -20 | tail -n +1 | nl
     1  @SRR2050320.1 1 length=133
     2  AATNCACTAAACTACTACCACCCTTCACATACATAAAAAATACAATTAAATATTTTATCCATATAAAATATNCTATTATTAATTTTAATACTACATATCNNNNNNNNNNNNNNNNNNNNNNNNNNNTANNNTT
     3  +SRR2050320.1 1 length=133
     4  CCC#8:DDFCFGGGGGGGGGG@FF@EFDFEGGGGGGFCF=FFGGGAEE=D,DD,44D>CEFEGCEGAGGF8#91==1EF8,=@,3CC?FGG8F,=D,++###########################05###))
     5  @SRR2050320.2 2 length=107
     6  TACNTCAATATTTCCAACAAACCAATATTTCAAAATATCCCCTAAAACCACTTTTAAAAATCGTTTTGTTANATAAAATACTAAATAAAAAATAAACTTNNNNNNNN
     7  +SRR2050320.2 2 length=107
     8  CCC#8CFGFGFGGGGGGFGGGGGGGGFDEFEFGEGGGGGGGGGGFDGGGFFE@9=DDDEDGGG@=DCFDGF#99A68BFFGGGGGFDFG8D>?DDFFFF########
     9  @SRR2050320.3 3 length=183
    10  AAANCTTCGACCAGATCGAAAAATACACCGTCCTTCGAAAACCTAATAAAACCAATAAAAATCAAAACCCANACATTTTCCACCTATTCTCCCTCTTAANNNNNNNNNNNNNNNNNNNNNNNNNNNTANNNTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNTNNNNNNNNNNNN
    11  +SRR2050320.3 3 length=183
    12  CCC#8=CF<DFCG8EFDFGGCGGCGGGGDCFFGGGGDGEGEC@FGGGGBGGGGDD>D9>>FEFFE=DEEG6#99D?8;9,@EGFF,6C9CF?D8DGDFF###########################0)###33##################################)##)############
    13  @SRR2050320.4 4 length=289
    14  CAAAATTCTACATACATACATATAAATTTACTAATAATACACCATACTTCCCACAATTCATTAAAAATATANCTCTTTACATAAAATCAATATCCTTTTNNNNNNNNNNNNNNNNTCTNNANACTATTNNNATTNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNCCNNCNNNNNNNNNNANNNNNNNNNNNNNNNANNNNNANNNNNAANNNNNNACTNAATNNNTAANNCNNNNNNTNNTCTNNNNNNNNTTATTTNNNNNNNNNNTTNNNNNNNNNNNNAAAANN
    15  +SRR2050320.4 4 length=289
    16  CCCC@,<C@@CFEFDG9F,A@@BE,A@<<FCGE,?A;,,E:FFG:EGGCGF=E>C=9DDGF;@8,@,@8EA#,911,,==D=DGC@CD,2=FG8D3=+0################)3*##1#)51=)*###0))########################3#######)0##)##########)###############1#####)#####))######-/)#)/)###-((##(######0##,((########//63:)##########,(############--(-##
    17  @SRR2050320.5 5 length=289
    18  AATATAAAACACTCAATTTTCCCCATATAAATACTACTTTATCTCTAAAAATCATAACCGAATAACTCAATNAACTAACGCACAAAAATAACGAAACGCNNNNNNNNNNNNNNNNTTCNNANCCCCGANNNTCCNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNCCNNCNNNNNNNNNNTNNNNNNNNNNNNNNNCNNNNNANNNNNACNNNNNNTCCNACCNNNACNNNCNNNNNNCNNTTTNNNNNNNNCCCAANNNNNNNNNNNCANNNNNNNNNNNNCTAANN
    19  +SRR2050320.5 5 length=289
    20  CCCCCGGGGGGGGGGAFEEFAEFFGGFGGFFGGGGGGGFCFGGGGC,EAFCEFGGG9DGGGED,=EFGA8;#9@DFGGGGGGGGDDFGDFGGDGDDG53################)*5##5#)3)8:)###)3)########################5#######5)##5##########)###############/#####)#####((######((-#(0/###/(###(######/##(((########,-(-,###########,(############((--##
brentp commented 8 years ago

So, I would take those 2 sets of 5 records and put them in r_1.fastq and r_2.fastq and get them to work. The don't have matching "@name" lines so that might be the problem. you might need to remove the length.

cb4github commented 8 years ago

Not sure what you mean by "matching...". As I understand fastq, there are normally(?) four lines per sequence.

Does bwameth.py expect every such line to start with "@name"?

brentp commented 8 years ago

I mean the first line of each quartet must match in file1 and file2.

what genome are you aligning to?

If you can paste without line-numbers then I can use them and try it out.

cb4github commented 8 years ago

Organism: Medaka fish Genome: Refseq (sorrry, evidently markdown fails for external ftp links) ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Oryzias_latipes/latest_assembly_versions/GCF_000313675.1_ASM31367v1/GCF_000313675.1_ASM31367v1_genomic.fna.gz

1st 20 reads without line numbers: $ zcat SRR20503201.fastq.gz | head -20 @SRR2050320.1 1 length=305 TTTATGAAATGTAAAGAGAGGGAGNAAGGGGGGTANAATATTNTGGANNTGAAGAGTGNNAGATTATTANANACGANNGGATNTANATNNNNGGTNTTAAATTNTTTTATTAGGGTTANTTATANTAAAANNNTAGTTATTTTAGTTGAAGTGTTATAATTAAAGTTGTATGTGGGGTGGGGGGTATTTGTTTTTTTTGTTTTTTTTTATTTTTGTTTTTGGTTGAAAGAAATGTTTATGAGATAGNTNNGNNNTTTTTGTANTTTGTATTTTATAATTAAANGGCGTNTTAAGAAAATTATCTT +SRR2050320.1 1 length=305 CCCCCGGGGGGGGGGGGGGGGGGG#=CFGGGGGGG#:CFGGF#:BFG##::=FGGGGG##:BFGGGGFG#:#:BFF##::FF#:B#:B####::B#:BFFFGF#:@FFFGGGGGGGGG#8@FFC#8@FFG###66>=CEGGGGAFFFGFFGFEGGGGGGGDGFGGGGGGCEGGGGGC=BEGDGGGCFFGGFFCFCEEEGGGGGEGG=EGGFGGGGGFFGGGDDFDF769<>FFFGFFFFF4@FFF#)##(###0--:<9:2#--8?:AFFFFFFFFA<)64#((-,-#-4::?941,48<:><< @SRR2050320.2 2 length=307 TGGGGATTGGGTTTTAATGAGATGNAAGAAATTATNGNTTTGNAAGTNNATATATGGANNATATTAGATNTNTTAGNNTAGGNTANTTNNNNTAANATTTAAGNGTTTTGAACGTATGNAGTTTNTTTTGNNNTAATAGGTGTTTACAATTTTTTTTATATAATTTTGTTTATAAAAATAATAGTAAAAAAAAAAAAGAGATTAAGTTTTAAAAATAGATTGTAAGTTATAAGTTAAAAGTTAATANANNANNNTTTAATGTNTTGAGTGTATTTTTTAATANATAANNAAATANGAGTTGATATTA +SRR2050320.2 2 length=307 CCCCCGG@E@F@FFGGGFAF9FFE#=C8CFFFFF<#6#::D,#9:<C##::CFFCFCD##::CCFDFCE#:#::C<##:96,#6,#::####6::#96CDDD9#,:B==,:AFEFBD,#49BFF#:AFF+###:44A+,,CEFFEEFFAFFFDFFFGGFCFEFDFGF;DDFCCDF;@9ECDGFCDEAAC9:CC:C=:__,,491,,2+<@979C4;7+111277:+1177:+32++++++37++#0##(###.-(,-)..#((-((.-5))4)-.(.3))#(,(-##(((-,#((-,())..-). @SRR2050320.3 3 length=306 ATAAGATTTAGAGGGAAATAAAAANAAGAAAAAGGNTTAAATNATGGNNTGAAAATATNNAAAGAAATGNANAATANNAAGGNAANAANNNNTGTNTAGAAGGNGAGAGGAAAGAAAGNGTTAGNGTATTNNNAATGAGGTATAGAAAATAGTAAGTGGGAAATAAAAAAAATTGTTTAAAAAGTGGGTAGTTGAAGGTTGAATAAATAGTTATAGAAGTATATTATAGAGAGTGTGGATGAGCGTNTNNTNNNTAAAATTANAGATAGAAAGATAAGAAATNTATANNAAGTTNTGTAGGGATTA +SRR2050320.3 3 length=306 CCCCCGGGGGGGGGGGGGGGGGGG#=DFGGGGGGG#:CFGGF#:CFF##::@FFGGGG##:CFGGGGGG#:#:CFG##:CDF#:C#:C####9::#9CFFGGG#:BFGGGGGGGGGGD#:BFFF#4AFGG###::DFGGGGGGFGGGFGGGGGGFGFGGGGGGGGGGGGGGGGGFGBFGGGGGGCFFFGGGFFFGFGFGGGFEGGFFGGGGGGGGGGGGGGGGGGFGFFGF>DFGFFFFAFFFFF5#.##(###.//67?=8#/6,38AA9>:<AFAFAA?)#(-81##(-442#(.446)8::?F @SRR2050320.4 4 length=310 ATTTTATTTTCAATGATAAGGTAGNATATAGGATTNGATTTANATGTNNAGTTTAAGANNGTTTTAATTNTNGGATNNAAGANAAGAGNTNNTTTNGATAGGTNATAGTATATGATAAATAGGTNTAGGATANAGGGAAATATAGGTTATAAATGTATGAGGAGGGGAAGTATAGGTGGGAATAATTAGGGATGATTTTTTTTTGTATATTAAATAATTTATTAGAATAAATGAATATATACAACANANNAANNAATTATAGTATTATATAATATATATTTAATAAAANATATAAATAATTTTAAGTAGT +SRR2050320.4 4 length=310 BCCCCGGCGG<FGGGGFGCFGGG#:CC@DF9CCC#:CCCFG#:C96##::C,CFGAC##9:C,6CFFE#:#96,C##::6,#,:CD,#:##::=#:69B=8F#:BFDEDGGG<FFFFFG<F8F#::D+AF<#:++4:=FFGGGGFDFGGGGGGGGGGFGCFF+B@CE8EDCFD9>3@+@8BBFF;E8,@ECC87F:3@EEGG_;,>,B:7<,:2:+=@++++++11=++11+4++=+11186)/#)##(0##.(/23)))15)7)./.93??CF?E).1<)5))))#,(43(.6<)4)).)),).))) @SRR2050320.5 5 length=310 TTTGTTTTGATTTTGGGTAATAGANTAAAGTTGTTNAGTAAANGAAANNTTTTGTGGANNGAAGTTTGTNTNAGTTNNTTAANTTTTANTNNGAANACGAAGTNTATTTTGTTTTTGTAGAAAANGGAGTTTNGTTGATTTTGTGATCGGTGAAGATTCGGATATAGTTTTTCGTTAGGATATTTTAGAGTCGGATGGTGCGGTTAGACGAGTTTGTAATTACGTAGTTGGAGTTGGGGTGGAAACNGNNGTNNGTGATATTAGTTAAGTGATCGGTAAATATTCGTANGGGTTGGTAGTGATTCGTCGG +SRR2050320.5 5 length=310 CCCCCGGGGGGGGGGGGGGGGGGG#:CFGGFGGGG#:DFGGG#:CFG##:CFGGGGGG##:CFFFGGGF#:#:CFG##:CFF#:CFGG#:##::D#:CFFGGG#:BFGGGGGGGGGGGGGGGGG#:?DFFFG#:ADFGGGGGGGGGGGGGGGGGGGGGG:FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGCDFFGFDGGGGFFGDCFFGFAFFAEFEFFFFF)#0##/)##.-4:?F2<AAABBF<:>B>B:<<61>B?F0#--44:9>BAAFFBA<<63:>,

$ zcat SRR20503202.fastq.gz | head -20 @SRR2050320.1 1 length=133 AATNCACTAAACTACTACCACCCTTCACATACATAAAAAATACAATTAAATATTTTATCCATATAAAATATNCTATTATTAATTTTAATACTACATATCNNNNNNNNNNNNNNNNNNNNNNNNNNNTANNNTT +SRR2050320.1 1 length=133 CCC#8:DDFCFGGGGGGGGGG@FF@EFDFEGGGGGGFCF=FFGGGAEE=D,DD,44D>CEFEGCEGAGGF8#91==1EF8,=@,3CC?FGG8F,=D,++###########################05###)) @SRR2050320.2 2 length=107 TACNTCAATATTTCCAACAAACCAATATTTCAAAATATCCCCTAAAACCACTTTTAAAAATCGTTTTGTTANATAAAATACTAAATAAAAAATAAACTTNNNNNNNN +SRR2050320.2 2 length=107 CCC#8CFGFGFGGGGGGFGGGGGGGGFDEFEFGEGGGGGGGGGGFDGGGFFE@9=DDDEDGGG@=DCFDGF#99A68BFFGGGGGFDFG8D>?DDFFFF######## @SRR2050320.3 3 length=183 AAANCTTCGACCAGATCGAAAAATACACCGTCCTTCGAAAACCTAATAAAACCAATAAAAATCAAAACCCANACATTTTCCACCTATTCTCCCTCTTAANNNNNNNNNNNNNNNNNNNNNNNNNNNTANNNTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNTNNNNNNNNNNNN +SRR2050320.3 3 length=183 CCC#8=CFDFCG8EFDFGGCGGCGGGGDCFFGGGGDGEGEC@FGGGGBGGGGDDD9>>FEFFE=DEEG6#99D?8;9,@EGFF,6C9CF?D8DGDFF###########################0)###33##################################)##)############ @SRR2050320.4 4 length=289 CAAAATTCTACATACATACATATAAATTTACTAATAATACACCATACTTCCCACAATTCATTAAAAATATANCTCTTTACATAAAATCAATATCCTTTTNNNNNNNNNNNNNNNNTCTNNANACTATTNNNATTNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNCCNNCNNNNNNNNNNANNNNNNNNNNNNNNNANNNNNANNNNNAANNNNNNACTNAATNNNTAANNCNNNNNNTNNTCTNNNNNNNNTTATTTNNNNNNNNNNTTNNNNNNNNNNNNAAAANN +SRR2050320.4 4 length=289 CCCC@,C@@CFEFDG9F,A@@BE,A@<<FCGE,?A;,,E:FFG:EGGCGF=EC=9DDGF;@8,@,@8EA#,911,,==D=DGC@CD,2=FG8D3=+0################)3##1#)51=)_###0))########################3#######)0##)##########)###############1#####)#####))######-/)#)/)###-((##(######0##,((########//63:)##########,(############--(-## @SRR2050320.5 5 length=289 AATATAAAACACTCAATTTTCCCCATATAAATACTACTTTATCTCTAAAAATCATAACCGAATAACTCAATNAACTAACGCACAAAAATAACGAAACGCNNNNNNNNNNNNNNNNTTCNNANCCCCGANNNTCCNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNCCNNCNNNNNNNNNNTNNNNNNNNNNNNNNNCNNNNNANNNNNACNNNNNNTCCNACCNNNACNNNCNNNNNNCNNTTTNNNNNNNNCCCAANNNNNNNNNNNCANNNNNNNNNNNNCTAANN +SRR2050320.5 5 length=289 CCCCCGGGGGGGGGGAFEEFAEFFGGFGGFFGGGGGGGFCFGGGGC,EAFCEFGGG9DGGGED,=EFGA8;#9@DFGGGGGGGGDDFGDFGGDGDDG53################)*5##5#)3)8:)###)3)########################5#######5)##5##########)###############/#####)#####((######((-#(0/###/(###(######/##(((########,-(-,###########,(############((--##

(Thanks again.)

cb4github commented 8 years ago

Also, in my experience the indexing failed unless I convert the genome to blocked gzip, e.g.

zcat <genome>.gz | bgzip -c > <genome.bgz>
brentp commented 8 years ago

I opened an issue on bwa for this here

cb4github commented 8 years ago

fyi, please ignore my previous restriction re bgz as I am unable to reproduce it.

Also, I'm currently running bwa-meth (see CLI below) using gzip'd reference and unzipped read files (13GB each)and it seems to be working fine, though not completed yet. I'll report back with result.

time bwameth.py \
    --reference $REFERENCE \
    $READS/SRR2050320_1.fastq \
    $READS/SRR2050320_2.fastq \
    -t $SLURM_NTASKS_PER_NODE
cb4github commented 8 years ago

fyi, re bias-plot.py (and while the above bwa-meth attempt with unzipped reads is still running as I type) in a separate run I've also tried bwa mem separately using the reads preconverted via c2t...

time bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -t 20 \
        $PROJECT/OryziasLatipes/GCF_000313675.1_ASM31367v1_genomic.fna.gz.bwameth.c2t \
        $PROJECT/sra-toolkit/SRR2050320.c2t \
        | samtools view -bS - | samtools sort -m 2415919104 - -T bwa-meth -o bwa-meth.bam

...and that ran successfully. Per the subsequent alignment check, the reads were recognized as paired...

40360984 + 0 in total (QC-passed reads + QC-failed reads)
1068934 + 0 secondary
0 + 0 supplimentary
0 + 0 duplicates
39993672 + 0 mapped (99.09%:-nan%)
39292050 + 0 paired in sequencing
19646025 + 0 read1
19646025 + 0 read2
37242588 + 0 properly paired (94.78%:-nan%)
38657828 + 0 with itself and mate mapped
266910 + 0 singletons (0.68%:-nan%)
1147380 + 0 with mate mapped to a different chr
567463 + 0 with mate mapped to a different chr (mapQ>=5)

...but the bias plot...

REFERENCE=../OryziasLatipes/GCF_000313675.1_ASM31367v1_genomic.fna.bgz
python ~/bwa-meth/bias-plot.py bwa-meth.bam $REFERENCE

...fails with...

[fai_load] build FASTA index.
[fai_fetch] Warning - Reference rNC_019859.1 not found in FASTA file, returning empty sequence
Failed to fetch sequence in rNC_019859.1
[fai_fetch] Warning - Reference fNC_019859.1 not found in FASTA file, returning empty sequence
Failed to fetch sequence in fNC_019859.1
[fai_fetch] Warning - Reference rNC_019860.1 not found in FASTA file, returning empty sequence
Failed to fetch sequence in rNC_019860.1
[fai_fetch] Warning - Reference fNC_019860.1 not found in FASTA file, returning empty sequence
Failed to fetch sequence in fNC_019860.1
[fai_fetch] Warning - Reference rNC_019861.1 not found in FASTA file, returning empty sequence
Failed to fetch sequence in rNC_019861.1
Traceback (most recent call last):
  File "/home/user/bwa-meth/bias-plot.py", line 112, in <module>
    check_bias(bam, ref)
  File "/home/user/bwa-meth/bias-plot.py", line 59, in check_bias
    read_length = max(meth_counts[0].keys())
ValueError: max() arg is an empty sequence

So then I tried the bias plot with the .gz (vice .bgz) reference...

REFERENCE=../OryziasLatipes/GCF_000313675.1_ASM31367v1_genomic.fna.gz
time python ~/bwa-meth/bias-plot.py bwa-meth.bam $REFERENCE

...and failed with...

[fai_load] build FASTA index.
Cannot index files compressed with gzip, please use bgzip
[fai_load] fail to open FASTA index.
Could not load fai index of ../OryziasLatipes/GCF_000313675.1_ASM31367v1_genomic.fna.gz
[fai_load] build FASTA index.
Cannot index files compressed with gzip, please use bgzip
[fai_load] fail to open FASTA index.
Could not load fai index of ../OryziasLatipes/GCF_000313675.1_ASM31367v1_genomic.fna.gz
[fai_load] build FASTA index.
Cannot index files compressed with gzip, please use bgzip
[fai_load] fail to open FASTA index.
Could not load fai index of ../OryziasLatipes/GCF_000313675.1_ASM31367v1_genomic.fna.gz
[fai_load] build FASTA index.
Cannot index files compressed with gzip, please use bgzip
[fai_load] fail to open FASTA index.
Could not load fai index of ../OryziasLatipes/GCF_000313675.1_ASM31367v1_genomic.fna.gz
[fai_load] build FASTA index.
Cannot index files compressed with gzip, please use bgzip
[fai_load] fail to open FASTA index.
Could not load fai index of ../OryziasLatipes/GCF_000313675.1_ASM31367v1_genomic.fna.gz
Traceback (most recent call last):
  File "/home/user/bwa-meth/bias-plot.py", line 112, in <module>
    check_bias(bam, ref)
  File "/home/user/bwa-meth/bias-plot.py", line 59, in check_bias
    read_length = max(meth_counts[0].keys())
ValueError: max() arg is an empty sequence
cb4github commented 8 years ago

Ok, using bwameth.py with unzipped reads...

time bwameth.py \
    --reference $REFERENCE \
    $READS/SRR2050320_1.fastq \
    $READS/SRR2050320_2.fastq \
    -t $SLURM_NTASKS_PER_NODE

...here's the alignment check results...

34571004 + 5789978 in total (QC-passed reads + QC-failed reads)
49440 + 1019492 secondary
0 + 0 supplimentary
0 + 0 duplicates
34262741 + 5730929 mapped (99.11%:98.98%)
34521564 + 4770486 paired in sequencing
17260782 + 2385243 read1
17260782 + 2385243 read2
33834832 + 0 properly paired (98.01%:0.00%)
34005416 + 4652388 with itself and mate mapped
207885 + 59049 singletons (0.60%:1.24%)
98538 + 1030988 with mate mapped to a different chr
60682 + 363776 with mate mapped to a different chr (mapQ>=5)

Also, (using the unzipped reads) I now get the bias plot via bias-plot.py (presumably deprecated)...

bwa-meth bias

Going forward it would be nice to be able to run (bwa mem per your opened issue above) with zipped reads, thanks.

brentp commented 8 years ago

@cb4github I didn't notice this but someone else did, your seqs and quality scores were not the same length . That was what was causing the error from bwa. If you still see another error in current version of bwa-meth, please let me know.