barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
142 stars 21 forks source link

Segmentation fault in samtools sort #309

Closed ellieharrisn closed 1 year ago

ellieharrisn commented 2 years ago

Hello

I am attempting to run breseq using a de novo aligned draft reference (.gff) and hitting an error at the sorting stage (output below). The reference is in ~200 contigs with some very small (128bp). Could this be the issue and can I get around it? I have run the analysis using -r and -c and hit the same issue (I removed the previous output each time). I have included the details of the files I am using below.

Any help appreciated - thanks!

breseq output: +++ NOW PROCESSING Preliminary analysis of coverage distribution [samtools] import /shared/harrison_lab1/User/bo1eah/NERC_tripartite_evolution1/breseq/N1_T0/data/reference.fasta.fai /shared/harrison_lab1/User/bo1eah/NERC_tripartite_evolution1/breseq/N1_T0/03_candidate_junctions/best.sam /shared/harrison_lab1/User/bo1eah/NERC_tripartite_evolution1/breseq/N1_T0/03_candidate_junctions/best.unsorted.bam [samtools] sort --threads 8 -o /shared/harrison_lab1/User/bo1eah/NERC_tripartite_evolution1/breseq/N1_T0/03_candidate_junctions/best.bam -T /shared/harrison_lab1/User/bo1eah/NERC_tripartite_evolution1/breseq/N1_T0/03_candidate_junctions/best.bam /shared/harrison_lab1/User/bo1eah/NERC_tripartite_evolution1/breseq/N1_T0/03_candidate_junctions/best.unsorted.bam !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Segmentation Fault !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Backtrace with 0 stack frames. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Snippits of .gff reference file ($ref_chr)

gff-version 3

sequence-region NODE_1 1 625525

sequence-region NODE_2 1 545641

sequence-region NODE_3 1 489161

sequence-region NODE_4 1 420774

sequence-region NODE_5 1 414920

...

sequence-region NODE_165 1 128

sequence-region NODE_166 1 128

sequence-region NODE_167 1 128

sequence-region NODE_168 1 128

sequence-region NODE_169 1 128

..... NODE_1 Prodigal:2.6 CDS 216 1097 . - 0 ID=34044_TRX19vTRX321_00001;inference=ab initio prediction:Prodigal:2.6;locus_tag=34044_TRX19vTRX321_00001;product=hypothetical protein NODE_1 Prodigal:2.6 CDS 1130 1582 . - 0 ID=34044_TRX19vTRX321_00002;inference=ab initio prediction:Prodigal:2.6;locus_tag=34044_TRX19vTRX321_00002;product=hypothetical protein NODE_1 Prodigal:2.6 CDS 1790 5014 . + 0 ID=34044_TRX19vTRX321_00003;eC_number=2.4.99.16;Name=glgE_1;gene=glgE_1;inference=ab initio prediction:Prodigal:2.6,protein motif:HAMAP:MF_02124;locus_tag=34044_TRX19vTRX321_00003;product=Alpha-1%2C4-glucan:maltose-1-phosphate maltosyltransferase .... NODE_110 Prodigal:2.6 CDS 12 242 . + 0 ID=34044_TRX19vTRX321_07582;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:ISfinder:ISRel26;locus_tag=34044_TRX19vTRX321_07582;product=ISNCY family transposase ISRel26 NODE_112 Prodigal:2.6 CDS 60 215 . + 0 ID=34044_TRX19vTRX321_07583;inference=ab initio prediction:Prodigal:2.6;locus_tag=34044_TRX19vTRX321_07583;product=hypothetical protein NODE_119 Prodigal:2.6 CDS 84 272 . + 0 ID=34044_TRX19vTRX321_07584;inference=ab initio prediction:Prodigal:2.6;locus_tag=34044_TRX19vTRX321_07584;product=hypothetical protein NODE_127 Prodigal:2.6 CDS 20 145 . + 0 ID=34044_TRX19vTRX321_07585;inference=ab initio prediction:Prodigal:2.6;locus_tag=34044_TRX19vTRX321_07585;product=hypothetical protein

FASTA

NODE_1 GTGTCATCACCCGCCCCCGCTCCGCCCTTTCAACCCGGCCCAGCCGGCCGGATCGGGGGC

example of .fastq file ($R1 and R2) @A00892:159:HKNC2DRXY:2:2104:5095:1016 1:N:0:CGCACTTCGT+AATATGCCAG ACCAACTAAAAATTACCATCAACCTCAAACACATCGGAAATAACGACGACACGATCAAAGCGCAACACATTAAACAAAACAACGTAAGTACGTAGCCGCTTATAAAATATATCTAAAGAACACAACTCTAAACAACAGACACCACAAAT + :FF:::F:,::,FFF,FFF:FF,F,FFF,:,F,F,F,,F,:FF:F:FF,FF:F,FF,::F:F,:F:F:,FF,::,FFF:,,:,,F:F,F,,F:,FFF,:FF:,F,,,:FFFF:,::FFF,:FFF,:::,,::F,,F:,:F:F,,F,,,F @A00892:159:HKNC2DRXY:2:2104:22987:2033 1:N:0:CGCACTTCGT+AATATGCCAG ATTCCGAACAAGGCAACCAGCAACCCCGATAACCACTACTCCAAAAAATGAATCAACAACACCAGAAAAACAAATATTAAATCCAAAATAATCAACACCACAAACAGCACTACCTAACTTTTGACCCGAAAAGAGATAGAATAAGCGGCTA + :F,FFF::F,FF,F,:FF,F,:FFFF:F:F,F:F,,:F,,F:,,F::,FF:FF:,FF,F::F::F,,FFFF,,:F,:F:::,FF,::,FFFFFF:,,:F,:F,:::FF:F:F:FF,,,F,F:,,,F,FF,FF,:F:FFF,,,,FFFFFFF, @A00892:159:HKNC2DRXY:2:2104:18385:2206 1:N:0:CGCACTTCGT+AATATGCCAG ATGATAGACCGGCGTGAAGCTGTGCATCGTCACGATGATGCTGTCTTGCCCCCTTGCCCGACGATCGCGGATCAGCCCGCGAATGGCGTCGTGGAAAGGCACATAGG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FF:FFFFFFFF

command run: breseq -j 8 -o /path/$SAMPLE_ID -c $ref_chr $R1 $R2

jeffreybarrick commented 2 years ago

I wouldn't expect the short contigs to cause a problem, but I'm also not sure what this could be.

What version of breseq are you using?

If it is an older one, please try with a new one.

If it is a newer one (0.36.1 or 0.37.0), then please contact me at the email address in the header, and I can provide a place for you to upload your files for me to test so I can reproduce the bug.

ellieharrisn commented 2 years ago

Hi,

Thank you very much for investigating this. I am using 0.36.1. I have it working with my phage genome only (genbank file, one contig) but if fails when I add the bacterial ancestor .gff (samples are lysogens from an EE experiement).

Incidentally, it may be worth noting that I originally had trouble using my fastq.gz files - different samples would give me one of various errors saying they lacked headers or the quality score lengths didn’t match the read lengths etc and when I looked at the offending file they would have the fault. Redownloading the files they are fine but (different) faults appear after attempting the analysis. I have gotten around this by decompressing them before running breseq.

Sorry to pile more on but mentioning it incase it is useful when I send the files over.

Many thanks

Ellie

On 24 May 2022, at 11:31, Jeffrey Barrick @.***> wrote:

I wouldn't expect the short contigs to cause a problem, but I'm also not sure what this could be.

What version of breseq are you using?

If it is an older one, please try with a new one.

If it is a newer one (0.36.1 or 0.37.0), then please contact me at the email address in the header, and I can provide a place for you to upload your files for me to test so I can reproduce the bug.

— Reply to this email directly, view it on GitHub https://github.com/barricklab/breseq/issues/309#issuecomment-1135740774, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZJUK25RUNLDEV6SXD3HYPDVLSVXRANCNFSM5WXHIYUQ. You are receiving this because you authored the thread.

jeffreybarrick commented 2 years ago

Regarding the FASTQ errors. It sounds like you may have had problems transferring the files, leading to them to be truncated at random points. This could happen due to connection issues or due to the downloads filling up all free space on your hard disk.

It can also be common to have problems during breseq during a run if you are running out of hard disk space. Do you think this could be the issue in your case?

(If not, you'll need to email me directly rather than hitting reply to this issue so I can share a folder with your email address to upload the input files for testing.)

ellieharrisn commented 2 years ago

Ok, that will be it. I was originally messing around with a small allowance but now I have 10TB to play with. I will rerun the phage only analysis with the .gz on the server to be sure but this sounds spot on. Thanks

On 24 May 2022, at 11:58, Jeffrey Barrick @.***> wrote:

Regarding the FASTQ errors. It sounds like you may have had problems transferring the files, leading to them to be truncated at random points. This could happen due to connection issues or due to the downloads filling up all free space on your hard disk.

It can also be common to have problems during breseq during a run if you are running out of hard disk space. Do you think this could be the issue in your case?

(If not, you'll need to email me directly rather than hitting reply to this issue so I can share a folder with your email address to upload the input files for testing.)

— Reply to this email directly, view it on GitHub https://github.com/barricklab/breseq/issues/309#issuecomment-1135767836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZJUK2ZMJ4YTCAJU75FXPUDVLSY6HANCNFSM5WXHIYUQ. You are receiving this because you authored the thread.

jeffreybarrick commented 1 year ago

Looks like this was resolved, so closing issue.

ellieharrisn commented 1 year ago

Hi,

Thanks for your help. We are still trying to resolve it but it appears to be a memory issue with our servers rather than a BreSeq issue.

thanks

Ellie

On 21 Sep 2022, at 14:46, Jeffrey Barrick @.***> wrote:

Looks like this was resolved, so closing issue.

— Reply to this email directly, view it on GitHub https://github.com/barricklab/breseq/issues/309#issuecomment-1253735923, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZJUK24LZ43EU7CHOWAX3D3V7MGR3ANCNFSM5WXHIYUQ. You are receiving this because you authored the thread.

ellieharrisn commented 1 year ago

Hi,

Just to let you know, the fastq.gz error was absolutely down to memory and is now working fine. I still encounter the segmentation error at the sort stage though and have tried with a fasta file instead of the .gff with the same issue.

Many thanks

Ellie

On 24 May 2022, at 12:01, Ellie harrison @.***> wrote:

Ok, that will be it. I was originally messing around with a small allowance but now I have 10TB to play with. I will rerun the phage only analysis with the .gz on the server to be sure but this sounds spot on. Thanks

On 24 May 2022, at 11:58, Jeffrey Barrick @. @.>> wrote:

Regarding the FASTQ errors. It sounds like you may have had problems transferring the files, leading to them to be truncated at random points. This could happen due to connection issues or due to the downloads filling up all free space on your hard disk.

It can also be common to have problems during breseq during a run if you are running out of hard disk space. Do you think this could be the issue in your case?

(If not, you'll need to email me directly rather than hitting reply to this issue so I can share a folder with your email address to upload the input files for testing.)

— Reply to this email directly, view it on GitHub https://github.com/barricklab/breseq/issues/309#issuecomment-1135767836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZJUK2ZMJ4YTCAJU75FXPUDVLSY6HANCNFSM5WXHIYUQ. You are receiving this because you authored the thread.

jeffreybarrick commented 1 year ago

If you can share a dataset with the problem (reads + reference + command line) so I can see if I can reproduce it, email me at the address in the breseq header, and I will create a shared folder for you to upload the large files to.