CMU-SAFARI / Apollo

Apollo is an assembly polishing algorithm that attempts to correct the errors in an assembly. It can take multiple set of reads in a single run and polish the assemblies of genomes of any size. Described in the Bioinformatics journal paper (2020) by Firtina et al. at https://people.inf.ethz.ch/omutlu/pub/apollo-technology-independent-genome-assembly-polishing_bioinformatics20.pdf
GNU General Public License v3.0
27 stars 2 forks source link

CONSISTENT ERROR -FastaIndex: Record has inconsistent line lengths or line endings #8

Open desmodus1984 opened 2 years ago

desmodus1984 commented 2 years ago

Hi. I built an assembly and I am trying to polish it with apollo. I installed it as told, and followed all the steps. I converted the fastq files into fasta one-liners head reads2.fasta

V300066187L4C001R0010000000/2 GTGTCAGATGTGTTATATAGCTTGATTTTAACCATTTAACCAATACATACATGAAGATATATACCCCAAATATATGCCATTTGTGTCAAGTATACCTGAA V300066187L4C001R0010000014/2 ATCTGTATTTATACCAATTGATTTTAATCCTGTCAATTTCTATCGCAAAGGTTAGGGCGTTTCTTATCTCCATTCCAGGGAGTAAAGATTATGTAGCTTA V300066187L4C001R0010000017/2 AAAGCTGCGCCCAAAACTCCCACCCGGCTAGACAGTTCAGTTCCTCTCCATATGTCACTGGATTTCCCCAAAGCCACTACCTGGTGCTGGAGCTCACCGG V300066187L4C001R0010000029/2 GTTTCTGTTGAGAAATCGTTTGATAATCTGATGGGGGATCCTTTGTAGGTAACTCTCTGTTTCTCTCTTGCTGCCTTTAAGATTCTCTCTTTGTCTTGAA V300066187L4C001R0010000038/2 TCTCACACTGATATTTTTTTCTCTCTCTCCCCTTCTCTCTCTCTCTAAAATCAATAAACATACCTTTGGGTGAGGATAAACAGAATAGTGCTTGTTTCTC

I did convert the sam to bam and sorted it and indexed it /users/PHS0338/jpac1984/appz/bwa-mem2-2.2.1_x64-linux/bwa-mem2 mem -t 48 Hapo -R '@RG\tID:PA113-1\tSM:bar\tPL:DNBSEQ' \ /fs/scratch/PHS0338/BGI-reads/reads_1.fq > PA113-1.sam /fs/scratch/PHS0338/appz/samtools-1.14/samtools view -hb -@ 48 PA113-1.sam > PA113-1.bam /fs/scratch/PHS0338/appz/samtools-1.14/samtools view -h -@ 48 -F4 PA113-1.bam | /fs/scratch/PHS0338/appz/samtools-1.14/samtools sort -@ 48 -m 3G -O bam -o PA113-1.sorted.bam /fs/scratch/PHS0338/appz/samtools-1.14/samtools index -@ 12 PA113-1.sorted.bam

And I get the same SeqAn error that I do not know how to fix it.

The log: Assembly: /users/PHS0338/jpac1984/data/myse-hapog.fasta Pair of a set of reads and their alignments: /fs/scratch/PHS0338/BGI-reads/reads1.fasta, /fs/scratch/PHS0338/appz/sam-bams/PA113-1.sorted.bam /fs/scratch/PHS0338/BGI-reads/reads2.fasta, /fs/scratch/PHS0338/appz/sam-bams/PA113-2.sorted.bam Output file: myse-polished.fasta Maximum consecutive insertions: 3 Maximum consecutive deletions: 10 Transition probability to match states: 0.85 Transition probability to insertion states: 0.1 Overall deletion transition probabilities from a state: 0.05 Deletion transition factor: 2.5 Emission probability of a matching character: 0.97 Emission probability of a substitution (i.e., mismatch) character: 0.01 Emission probability of an inserted character: 0.333333 Filter size: 100 Viterbi filter size: 5 Viterbi batch size: 5000 Read chunking size (0 for original length): 1000 Max thread: 48 terminate called after throwing an instance of 'seqan::ParseError' what(): FastaIndex: Record has inconsistent line lengths or line endings /var/spool/slurmd/job8594593/slurm_script: line 10: 45058 Aborted (core dumped) bin/apollo -a /users/PHS0338/jpac1984/data/myse-hapog.fasta -r /fs/scratch/PHS0338/BGI-reads/reads1.fasta -r /fs/scratch/PHS0338/BGI-reads/reads2.fasta -m /fs/scratch/PHS0338/appz/sam-bams/PA113-1.sorted.bam -m /fs/scratch/PHS0338/appz/sam-bams/PA113-2.sorted.bam -t 48 -o myse-polished.fasta

Any idea of why it is failing all the time? I have all the input files are required and it fails.

canfirtina commented 2 years ago

Hi @desmodus1984,

Seems like the sequence identifiers (headers) do not start with '>', which is probably a must for SeqAn to parse and index the FASTA file properly. Can you try adding '>' to the beginning of sequence identifiers and rerunning Apollo again? I would also check the encoding of your text file and some unexpected hidden characters that you may have in your line endings, which may be messing up with your FASTA file.

You can potentially use seqtk seq to convert your FASTA file in a way that Apollo requires. It would hopefully resolve the issues that you may experience regarding formatting and line endings.

Best,

Can Firtina

desmodus1984 commented 2 years ago

Hi Firtina,

I checked my files again and they seem to be fine: sequence identifiers (headers) start with '>' head reads1.fasta>V300066187L4C001R0010000000/1AATGTAAATACATTTTTGTATCCTACTGTTTATTGTACTCTTATTACAGGCATTTTCCACTTTGTTCTGCAGTCTGTATTTTAAAAAATGCTATATTATC>V300066187L4C001R0010000014/1TGAGAAAGGTTGTTTCCCCAGGTAGGAATTTTCCCCTGAAGTTAGGGAGGGGATAAAGCCCCTTAACTAAGTGCCAGGTGGGTAGTTAATCACTTTAACT>V300066187L4C001R0010000017/1CCTAGCCCCACACCAGACCCCCAGCCCAGAGTCCAGAGCTGGGAAAATAAGTTACTGTAACTTCTGGCTATAAAAACCAGCGGGAACTGTGGCTGACTGA>V300066187L4C001R0010000029/1AGGGAGCTTCAGGACAACATGAAACGAAGTAACATACGCATAATAGGGCTGCAAGAAGGACAAGAAGAACAGCAAGGATTAGAAAATCTATTTGAAGAAA>V300066187L4C001R0010000038/1CACAGTATTTAACATGAGAATTTTTCACGTGTCAGGATAGAAAAGTTTAAATCAGCTCAAGGTTGATGACGATATAGAGAAACAAGCACTATTCTTTTTA head reads2.fasta>V300066187L4C001R0010000000/2GTGTCAGATGTGTTATATAGCTTGATTTTAACCATTTAACCAATACATACATGAAGATATATACCCCAAATATATGCCATTTGTGTCAAGTATACCTGAA>V300066187L4C001R0010000014/2ATCTGTATTTATACCAATTGATTTTAATCCTGTCAATTTCTATCGCAAAGGTTAGGGCGTTTCTTATCTCCATTCCAGGGAGTAAAGATTATGTAGCTTA>V300066187L4C001R0010000017/2AAAGCTGCGCCCAAAACTCCCACCCGGCTAGACAGTTCAGTTCCTCTCCATATGTCACTGGATTTCCCCAAAGCCACTACCTGGTGCTGGAGCTCACCGG>V300066187L4C001R0010000029/2GTTTCTGTTGAGAAATCGTTTGATAATCTGATGGGGGATCCTTTGTAGGTAACTCTCTGTTTCTCTCTTGCTGCCTTTAAGATTCTCTCTTTGTCTTGAA>V300066187L4C001R0010000038/2TCTCACACTGATATTTTTTTCTCTCTCTCCCCTTCTCTCTCTCTCTAAAATCAATAAACATACCTTTGGGTGAGGATAAACAGAATAGTGCTTGTTTCTC

And I think the encoding is fine.

I converted my fastq to fasta using bioawk

reads1.fasta: text/plain; charset=us-asciireads_1.fq: text/plain; charset=us-asciireads2.fasta: text/plain; charset=us-asciireads_2.fq: text/plain; charset=us-ascii

Any way to check for those "hidden characters". I have no idea how to do that. I do not expect bioawk to add hidden characters.

Best regards;

Juan Pablo Aguilar Cabezas

Ecology and Evolutionary Biology Ph.D. Candidate

Department of Biological Sciences

Ohio University, Athens OH


From: Can Firtina @.> Sent: Monday, January 3, 2022 10:11 AM To: CMU-SAFARI/Apollo @.> Cc: Aguilar Cabezas, Juan Pablo @.>; Mention @.> Subject: Re: [CMU-SAFARI/Apollo] CONSISTENT ERROR -FastaIndex: Record has inconsistent line lengths or line endings (Issue #8)


NOTICE: This message was sent from outside Ohio University. Please use caution when clicking links or opening attachments in this message.


Hi @desmodus1984https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdesmodus1984&data=04%7C01%7Cja569116%40ohio.edu%7C795a08dce6c84417d64a08d9cecb4cd1%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637768194832612194%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Vf7KTX339XggkiRbk3OC1PgHKXIOp0XJu5f15Z2c0Vc%3D&reserved=0,

Seems like the sequence identifiers (headers) do not start with '>', which is probably a must for SeqAn to parse and index the FASTA file properly. Can you try adding '>' to the beginning of sequence identifiers and rerunning Apollo again? I would also check the encoding of your text file and some unexpected hidden characters that you may have in your line endings, which may be messing up with your FASTA file.

You can potentially use seqtk seq to convert your FASTA file in a way that Apollo requires. It would hopefully resolve the issues that you may experience regarding formatting and line endings.

Best,

Can Firtina

— Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCMU-SAFARI%2FApollo%2Fissues%2F8%23issuecomment-1004157543&data=04%7C01%7Cja569116%40ohio.edu%7C795a08dce6c84417d64a08d9cecb4cd1%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637768194832612194%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=sp3v5wUyswcSCjsqgxseRUoFr7GEFnE5lQsbWjcC9Fo%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VNFD5VNCFDQYIEEEPDUUG4BRANCNFSM5LEPBGAQ&data=04%7C01%7Cja569116%40ohio.edu%7C795a08dce6c84417d64a08d9cecb4cd1%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637768194832612194%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=zirnFb7kwDADHqJ%2B3iVkdDnU%2Fw7U8CKnrQFaNRLqCWA%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>