Closed mcmero closed 5 years ago
I'll look into why this is occuring. Since it is occuring in the beginning I should be able to fix it quickly. Can you send me the first 20 or so lines of your fasta files? I also may need a copy of your cosmic_transcripts.fasta
file.
Unrelated but that is a crazy amount of hash functions (-g
).
Thanks Justin. Here's what my fastq files look like: R1:
@SRR949078.1 HWI-ST942:86:D094KACXX:1:1101:1108:2117 length=100
CTTGNACCACAGCGATGAAGGTGGACGCGATCTGGGTGTGATGGTGCCGGGTCTCCAGGGCTGCAGTCACTGCCTGGTCAATGTTTCTCTCCACTGTCGT
+SRR949078.1 HWI-ST942:86:D094KACXX:1:1101:1108:2117 length=100
@@@D#2ABFHF<?EHFGEGIIGHEIIGGGHBHCHIIBAAAGHIGDE@DEFE5>CCCCCCBB?;@CC>>CC:ACC::?:AACACC@CCCCCC94<>ACCB2
@SRR949078.2 HWI-ST942:86:D094KACXX:1:1101:1121:2136 length=100
TTTGAAGATGCCGCATTTGGATTGGATGAATTCCAAATTCTGCTTGCTTGCTTTTTAATATTGATATGCTTATACACTTACACTTTATGCACAAAATGTA
+SRR949078.2 HWI-ST942:86:D094KACXX:1:1101:1121:2136 length=100
????DBDB:22+0)@8<AFE9CEDDEDEEEIIDDEEDIIIECEEIIDCDD>EBDDCCCBCEI<AAEIIEEIEIIDD;ADDA>;?ADABAADAA???AA>B
@SRR949078.3 HWI-ST942:86:D094KACXX:1:1101:1108:2161 length=100
GAAATTTGTGTGGTTCGCTTCACACCAGTAACTGAAGAAGATCAAATTTCTTATACTTTGCTCTTTGCATACTTCAGTAGCAGAAAGCGCTATGGAGTAN
+SRR949078.3 HWI-ST942:86:D094KACXX:1:1101:1108:2161 length=100
CCCFFFFFHHHHHIJJIJJJJJJJJJJJIJJJIJJJJJJJJJJJJJJJJIJJIJIJJJJIIJJJJJJJJJJHHHHHFHFFFFFFEEEEDDDDDDDDDCCC
@SRR949078.4 HWI-ST942:86:D094KACXX:1:1101:1185:2168 length=100
CTGACCTGACCTGCTCTGGGATCCGCTGTGTGCGTCACGCTTGCCTGAAACATGAACAGAATCGGCAAGACCTGGTGAAAGCTGGCGTGCTGCCTCTGCT
+SRR949078.4 HWI-ST942:86:D094KACXX:1:1101:1185:2168 length=100
BBCFDDDFDHHHFJJIJJJJIJIJIJIIEGDHDHHJJIJJIJJGGIJGIGHIGG>ECEEHA3=ADFDDDDDDDCD:5@@C@?CAA8>09B?@ABCA>>C>
@SRR949078.5 HWI-ST942:86:D094KACXX:1:1101:1138:2172 length=100
CAATTATATTACTACCACTGACATGACTTTCCAAAAAAAACACATAATTTGAATCAACACAACCACCCACAGCCTAATTATTAGCATCATCCCTCTACTA
+SRR949078.5 HWI-ST942:86:D094KACXX:1:1101:1138:2172 length=100
CCCFFFFFHHHGGIJJJJJJJIJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIIJJJIHHHHFFFDDDDDDDDDDDEEEEDDDDEDDDDDDDDDDDEA
R2
@SRR949078.1 HWI-ST942:86:D094KACXX:1:1101:1108:2117 length=100
CTTGNACCACAGCGATGAAGGTGGACGCGATCTGGGTGTGATGGTGCCGGGTCTCCAGGGCTGCAGTCACTGCCTGGTCAATGTTTCTCTCCACTGTCGT
+SRR949078.1 HWI-ST942:86:D094KACXX:1:1101:1108:2117 length=100
@@@D#2ABFHF<?EHFGEGIIGHEIIGGGHBHCHIIBAAAGHIGDE@DEFE5>CCCCCCBB?;@CC>>CC:ACC::?:AACACC@CCCCCC94<>ACCB2
@SRR949078.2 HWI-ST942:86:D094KACXX:1:1101:1121:2136 length=100
TTTGAAGATGCCGCATTTGGATTGGATGAATTCCAAATTCTGCTTGCTTGCTTTTTAATATTGATATGCTTATACACTTACACTTTATGCACAAAATGTA
+SRR949078.2 HWI-ST942:86:D094KACXX:1:1101:1121:2136 length=100
????DBDB:22+0)@8<AFE9CEDDEDEEEIIDDEEDIIIECEEIIDCDD>EBDDCCCBCEI<AAEIIEEIEIIDD;ADDA>;?ADABAADAA???AA>B
@SRR949078.3 HWI-ST942:86:D094KACXX:1:1101:1108:2161 length=100
GAAATTTGTGTGGTTCGCTTCACACCAGTAACTGAAGAAGATCAAATTTCTTATACTTTGCTCTTTGCATACTTCAGTAGCAGAAAGCGCTATGGAGTAN
+SRR949078.3 HWI-ST942:86:D094KACXX:1:1101:1108:2161 length=100
CCCFFFFFHHHHHIJJIJJJJJJJJJJJIJJJIJJJJJJJJJJJJJJJJIJJIJIJJJJIIJJJJJJJJJJHHHHHFHFFFFFFEEEEDDDDDDDDDCCC
@SRR949078.4 HWI-ST942:86:D094KACXX:1:1101:1185:2168 length=100
CTGACCTGACCTGCTCTGGGATCCGCTGTGTGCGTCACGCTTGCCTGAAACATGAACAGAATCGGCAAGACCTGGTGAAAGCTGGCGTGCTGCCTCTGCT
+SRR949078.4 HWI-ST942:86:D094KACXX:1:1101:1185:2168 length=100
BBCFDDDFDHHHFJJIJJJJIJIJIJIIEGDHDHHJJIJJIJJGGIJGIGHIGG>ECEEHA3=ADFDDDDDDDCD:5@@C@?CAA8>09B?@ABCA>>C>
@SRR949078.5 HWI-ST942:86:D094KACXX:1:1101:1138:2172 length=100
CAATTATATTACTACCACTGACATGACTTTCCAAAAAAAACACATAATTTGAATCAACACAACCACCCACAGCCTAATTATTAGCATCATCCCTCTACTA
+SRR949078.5 HWI-ST942:86:D094KACXX:1:1101:1138:2172 length=100
CCCFFFFFHHHGGIJJJJJJJIJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIIJJJIHHHHFFFDDDDDDDDDDDEEEEDDDDEDDDDDDDDDDDEA
Here's the reference: cosmic_transcripts.fasta.gz
Is there a recommended range for the hash function?
(edit -- posted wrong read files)
Typically no more than 8, 3-6 range would be standard. In the TAP pipeline I thought it used 4 spaced seeds rather than k-mers.
@readmanchiu Does the TAP pipeline automatically make the filter with spaced seeds or is it manual?
The problem does seem to be due to read length issues. For a quick fix for now you could try filtering out all read pairs with one read shorter than the k-mer size (in your example 25).
Regarding the targets file: If you want reads per gene (rather than per transcript), it may make sense to group the transcripts by gene first in seperate files and use the -F
option (group by file).
TAP runs biobloommicategorizer (or biobloomcategorizer in previous versions). It's hard-coded and assumes space seed filtered, and the command is not configurable. The onus is on the user to make sure the bloomfilter is compatible with categorizer command
Hmmm... It may make sense to add at least one default/example set of seeds somewhere in the documentation.
Is this linked to users somewhere in the TAP documentation?
I think I've fixed it and pushed a fix to the master branch. @mcmero if possible can you send me the read that it fails at? It seems that the reads you posted don't actually have any problems with them. I started testing with some variable length reads and fixed some issues but I want to make sure everything is resolved. Thanks.
I tried running it again with the latest fix, as well as a regenerated bloomfilter using group by file and space seed filtering. It's now working without issues. Thanks.
I'm running into a core dump error while running biobloommicategorizer (v2.3.2). I'm creating a mask like so:
Then running (as part of the tap pipeline):
I then run into this error (I've cut some of the output):
My fastq files have different read lengths, so I am wondering whether this might be related.