BenLangmead / bowtie

An ultrafast memory-efficient short read aligner
Other
260 stars 77 forks source link

misbehavior of --un parameter when more than 1 core is used #53

Open germaximus opened 7 years ago

germaximus commented 7 years ago

According to the manual "As always, the --un, --max and --al parameters print reads exactly as they appeared in the input file."

Turns out it is not exactly true for [--un]. If more than a single processor is used, bowtie apparently sorts (or randomizes?) reads by names before reporting to a file. Moreover, it does so in chunks limited by the number of active processors. If the original fastq file was unsorted or partially sorted then output fastq will have the different order or reads. This is not the issue with a single core because then the chunks consist of only one read and no sorting is done. It feels like an unintended behavior. Example with bowtie 1.1.2 Read names extracted from the original fastq. As you can tell - partially sorted @HISEQ:942:HFKY5BCXY:1:1101:1247:2108 @HISEQ:942:HFKY5BCXY:1:1101:1089:2111 @HISEQ:942:HFKY5BCXY:1:1101:1142:2118 @HISEQ:942:HFKY5BCXY:1:1101:1237:2121 @HISEQ:942:HFKY5BCXY:1:1101:1162:2124 @HISEQ:942:HFKY5BCXY:1:1101:1118:2124

bowtie -p 1 --un gives exact same order (if all of them unmapped of course) @HISEQ:942:HFKY5BCXY:1:1101:1247:2108 @HISEQ:942:HFKY5BCXY:1:1101:1089:2111 @HISEQ:942:HFKY5BCXY:1:1101:1142:2118 @HISEQ:942:HFKY5BCXY:1:1101:1237:2121 @HISEQ:942:HFKY5BCXY:1:1101:1162:2124 @HISEQ:942:HFKY5BCXY:1:1101:1118:2124

bowtie -p 50 --un gives scrambled order, different from the original @HISEQ:942:HFKY5BCXY:1:1101:1089:2111 @HISEQ:942:HFKY5BCXY:1:1101:1103:2154 @HISEQ:942:HFKY5BCXY:1:1101:1083:2150 @HISEQ:942:HFKY5BCXY:1:1101:1118:2124 @HISEQ:942:HFKY5BCXY:1:1101:1120:2216 @HISEQ:942:HFKY5BCXY:1:1101:1162:2124

PS. I just noticed that the newer version of bowtie is available (1.2.0), haven't tested it yet.