Closed juancresc closed 3 years ago
Yes, I've seen this before. It results from an improperly FASTA format in the genome file. The samtools faidx command that ShortStack is calling fails. You can verify by simply testing the command
samtools faidx files/data/Triticum_aestivum.TGACv1.dna.toplevel.fa
If I'm right you'll get the same error and a failure to make the .fai genome index.
When I've seen it before it has been one of two things:
One or more completely empty lines in the FASTA file (i.e. that match the regex ^$).
One or more sequence data lines of unequal length. For instance a line with 60 nts, followed by a line with 70nts, will cause samtools faidx to fail.
I also noticed that when I googled using just your genome file name 'Triticum_aestivum.TGACv1.dna.toplevel.fa' among the top hits were complaints from other people who've tried to work with that particular file. That is consistent with my hypothesis of an incorrectly formatted (or corrupted) FASTA file for this genome build.
Hope this helps and let me know.
Just to notice, the same run, twice one after another. I've also run samtools and works fine. I don't know why the first run throws the error.
Anyway working fine now
Run Progress and Messages:
Mon Dec 4 11:58:39 EST 2017
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
Failed to make fai of stitched genome. Aborting Run
Run Progress and Messages:
Use of uninitialized value $fai_fields[1] in numeric lt (<) at ./files/libs/ShortStack/ShortStack line 1009, <FAI> line 136610.
Mon Dec 4 12:00:20 EST 2017
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
Done
Weird. But I'm glad you are up and running.
On Mon, Dec 4, 2017 at 12:02 PM, juanmas07 notifications@github.com wrote:
Just to notice, the same run, twice one after another. I've also run samtools and works fine. I don't know why the first run throws the error.
Anyway working fine now
Run Progress and Messages:
Mon Dec 4 11:58:39 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... Failed to make fai of stitched genome. Aborting Run
Run Progress and Messages: Use of uninitialized value $fai_fields[1] in numeric lt (<) at ./files/libs/ShortStack/ShortStack line 1009,
line 136610. Mon Dec 4 12:00:20 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... Done
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-349028284, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXieXfsKN3o2cunf0oyLFazvHLfnoYks5s9CWTgaJpZM4QyNnp .
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
Seems like the issue is in the stiched file:
samtools faidx files/output/sstack_toplevelformatted_taes21/wheat_stitched.fasta
[fai_build_core] different line length in sequence 'stitch_54'.
I'm able to run samtools in the genomefile, but not in the stiched
Weird. Can you send me the url where I can find the exact genome file you are working with, so I replicate the error here? I'll mark this as a bug. Once I get the exact genome file I will be able to begin testing.
Thanks, Mike
On Tue, Dec 12, 2017 at 7:12 AM, juanmas07 notifications@github.com wrote:
Seems like the issue is in the stiched file:
samtools faidx files/output/sstack_toplevelformatted_taes21/wheat_stitched.fasta [fai_build_core] different line length in sequence 'stitch_54'.
I'm able to run samtools in the genomefile, but not in the stiched
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351034117, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXifD0rtlZrumk5zdhJkcZbMCl3ho2ks5s_m3DgaJpZM4QyNnp .
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
I'm using this genome:
ftp://ftp.ensemblgenomes.org/pub/plants/release-37/fasta/triticum_aestivum/dna/Triticum_aestivum.TGACv1.dna.toplevel.fa.gz
I'm also using a formatted version, with the same results. For formatting I've used a biopython script:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
def FAfixer(input_fasta, output_fasta):
"""
"""
buffer_seqs = []
for seq_record in SeqIO.parse(input_fasta, "fasta"):
record = SeqRecord(seq_record.seq, id=seq_record.id, description=seq_record.description)
buffer_seqs.append(record)
SeqIO.write(buffer_seqs, output_fasta, "fasta")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()#pylint: disable=invalid-name
parser.add_argument("-i", "--input_fasta", help="(.fasta)", required=True)
parser.add_argument("-o", "--output_fasta", help="(.fasta)", required=True)
args = parser.parse_args()#pylint: disable=invalid-name
FAfixer(args.input_fasta, args.output_fasta)
Thanks I will take a look at this as soon as I can and get back to you
On Tue, Dec 12, 2017 at 7:38 AM, juanmas07 notifications@github.com wrote:
I'm using this genome:
ftp://ftp.ensemblgenomes.org/pub/plants/release-37/fasta/ triticum_aestivum/dna/Triticum_aestivum.TGACv1.dna.toplevel.fa.gz
I'm also using a formatted version, with the same results. For formatting I've used a biopython script:
!/usr/bin/env python
-- coding: utf-8 --
from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq
def FAfixer(input_fasta, output_fasta): """ """ buffer_seqs = [] for seq_record in SeqIO.parse(input_fasta, "fasta"): record = SeqRecord(seq_record.seq, id=seq_record.id, description=seq_record.description) buffer_seqs.append(record) SeqIO.write(buffer_seqs, output_fasta, "fasta")
if name == "main": import argparse parser = argparse.ArgumentParser()#pylint: disable=invalid-name parser.add_argument("-i", "--input_fasta", help="(.fasta)", required=True) parser.add_argument("-o", "--output_fasta", help="(.fasta)", required=True) args = parser.parse_args()#pylint: disable=invalid-name FAfixer(args.input_fasta, args.output_fasta)
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351039454, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXibXY0ZhjjTkEAkhltps39f7NHx6Tks5s_nO4gaJpZM4QyNnp .
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9
Hi again.
I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download.
Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else.
For comparison my testing environment is:
ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1
Hope this helps, Mike
On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 notifications@github.com wrote:
I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
Sorry that was a typo, the genome file is 13G. I'm running it again and adding all information I can
-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.
On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell notifications@github.com wrote:
Hi again.
I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download.
Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else.
For comparison my testing environment is:
ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1
Hope this helps, Mike
On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 notifications@github.com wrote:
I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351700258, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCu0wwOAxv_L0WKUyLNi7vmEVIlKKxMks5tARdzgaJpZM4QyNnp .
This is my last run:
(wheat.fasta is the 13 GB genome file)
nohup ./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack_toplevelformatted_taes21 --genomefile /dev/wheat.fasta --bowtie_cores 3 &
ShortStack version 3.8.4 Thu Dec 14 07:46:57 EST 2017 hostname: vzmarcosjuarezubuntu working directory: /home/trigo/runs/miRNA-MITE Settings: RNAfold_version: 2 bowtie_cores: 3 bowtie_m: 50 dicermax: 24 dicermin: 20 error_logfile: files/output/sstack_toplevelformattedtaes21/ErrorLogs.txt foldsize: 300 genomefile: /dev/wheat.fasta logfile: files/output/sstack toplevelformatted_taes21/Log.txt mincov: 0.5rpm mismatches: 1 mmap: u outdir: files/output/sstack_toplevelformatted_taes21 pad: 75 ranmax: 3 readfile: files/output/TAEs.21.fasta sort_mem: 768M strand_cutoff: 0.8 Run Progress and Messages: Thu Dec 14 07:46:58 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... [fai_build_core] different line length in sequence 'stitch_186'. Failed to make fai of stitched genome. Aborting Run
-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.
On Thu, Dec 14, 2017 at 9:56 AM, JM C juan.crescente@gmail.com wrote:
Sorry that was a typo, the genome file is 13G. I'm running it again and adding all information I can
-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.
On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell notifications@github.com wrote:
Hi again.
I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download.
Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else.
For comparison my testing environment is:
ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1
Hope this helps, Mike
On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 notifications@github.com wrote:
I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351700258, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCu0wwOAxv_L0WKUyLNi7vmEVIlKKxMks5tARdzgaJpZM4QyNnp .
Hi Juan,
I noticed you were using ShortStack 3.8.4. This is the latest commit on github but is not yet an "official" release. See https://github.com/MikeAxtell/ShortStack/releases for the official releases. That said, it shouldn't make a difference because the changes between the latest commit you used and the last official release won't affect this error. To make sure, I re-tested with the wheat genome on my system with 3.8.4. I had no problems completing a run and was not able to reproduce your error (Log.txt copied below).
So, I'm not sure what to tell you. I can't reproduce the problem here with the same genome. The error message you are getting comes from samtools faidx being unable to index the "stitched" genome, which is an internal construct that ShortStack makes to speed sequence retrievals for RNA folding (highly fragmented genomes really slow down sequence retrievals). The error indicates that the stitched genome is corrupted in some way such that samtools can't index it. But since I can't reproduce the error I'm not sure what else to do. I'm not convinced that it is a ShortStack bug.
You had mentioned that you pre-processed the genome and showed some code. I would suggest re-downloading the original genome fasta file, making sure the checksums are good, rebuilding the bowtie indices, and re-running.
ShortStack version 3.8.4
Thu Dec 14 09:43:44 EST 2017
hostname: comp-bc-0143.acib.production.int.aci.ics.psu.edu
working directory: /storage/work/mja18
Settings:
RNAfold_version: 2
bamfile: SS_1/SRR3721341.bam
dicermax: 24
dicermin: 20
error_logfile: SS_384/ErrorLogs.txt
foldsize: 300
genomefile: Triticum_aestivum.TGACv1.dna.toplevel.fa
logfile: SS_384/Log.txt
mincov: 0.5rpm
outdir: SS_384
pad: 75
sort_mem: 16G
strand_cutoff: 0.8
Run Progress and Messages:
Thu Dec 14 09:43:46 EST 2017
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
Done
Tally of primary alignments (INCLUDES unmapped, but EXCLUDES secondary, duplicate,
failed QC, and supplementary alignments): 8892343
Tally of PLACED primary alignments (EXCLUDES unmapped, secondary, duplicate,
failed QC, and supplementary alignments): 5267484
Thu Dec 14 09:51:29 EST 2017
Performing de-novo cluster identification and analyses.
At specified mincov of 0.5rpm with 5267484 placed primary reads,
mincov is 3 raw alignments
Completed at Thu Dec 14 10:29:10 EST 2017
Performing search for unplaced small RNAs.
Completed at Thu Dec 14 10:30:09 EST 2017
Thu Dec 14 10:30:10 EST 2017
Tally of loci by predominant RNA size (DicerCall):
DicerCall NotMIRNA MIRNA
N or NA 8884 0
20 1150 2
21 8590 74
22 1829 19
23 2774 3
24 59828 1
Unplaced small RNAs
Size MultiMapped NoAlignments
<20 6437 5707
20 2513 4030
21 6320 18115
22 3057 7317
23 2744 2918
24 10008 6732
24 3629 6247
On Thu, Dec 14, 2017 at 8:00 AM, juanmas07 notifications@github.com wrote:
This is my last run:
(wheat.fasta is the 13 GB genome file)
nohup ./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack_toplevelformatted_taes21 --genomefile /dev/wheat.fasta --bowtie_cores 3 &
ShortStack version 3.8.4 Thu Dec 14 07:46:57 EST 2017 hostname: vzmarcosjuarezubuntu working directory: /home/trigo/runs/miRNA-MITE Settings: RNAfold_version: 2 bowtie_cores: 3 bowtie_m: 50 dicermax: 24 dicermin: 20 error_logfile: files/output/sstack_toplevelformattedtaes21/ErrorLogs.txt foldsize: 300 genomefile: /dev/wheat.fasta logfile: files/output/sstack toplevelformatted_taes21/Log.txt mincov: 0.5rpm mismatches: 1 mmap: u outdir: files/output/sstack_toplevelformatted_taes21 pad: 75 ranmax: 3 readfile: files/output/TAEs.21.fasta sort_mem: 768M strand_cutoff: 0.8 Run Progress and Messages: Thu Dec 14 07:46:58 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... [fai_build_core] different line length in sequence 'stitch_186'. Failed to make fai of stitched genome. Aborting Run
-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.
On Thu, Dec 14, 2017 at 9:56 AM, JM C juan.crescente@gmail.com wrote:
Sorry that was a typo, the genome file is 13G. I'm running it again and adding all information I can
-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.
On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell notifications@github.com wrote:
Hi again.
I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download.
Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else.
For comparison my testing environment is:
ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1
Hope this helps, Mike
On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 notifications@github.com wrote:
I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub
https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351700258, or mute the thread
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
Dear Mike,
Thank you for your detailed explanation, I'll continue trying till I get it working. Once I have a solution I'll update you.
Thanks again and kind regards,
-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.
On Thu, Dec 14, 2017 at 12:41 PM, Mike Axtell notifications@github.com wrote:
Hi Juan,
I noticed you were using ShortStack 3.8.4. This is the latest commit on github but is not yet an "official" release. See https://github.com/MikeAxtell/ShortStack/releases for the official releases. That said, it shouldn't make a difference because the changes between the latest commit you used and the last official release won't affect this error. To make sure, I re-tested with the wheat genome on my system with 3.8.4. I had no problems completing a run and was not able to reproduce your error (Log.txt copied below).
So, I'm not sure what to tell you. I can't reproduce the problem here with the same genome. The error message you are getting comes from samtools faidx being unable to index the "stitched" genome, which is an internal construct that ShortStack makes to speed sequence retrievals for RNA folding (highly fragmented genomes really slow down sequence retrievals). The error indicates that the stitched genome is corrupted in some way such that samtools can't index it. But since I can't reproduce the error I'm not sure what else to do. I'm not convinced that it is a ShortStack bug.
You had mentioned that you pre-processed the genome and showed some code. I would suggest re-downloading the original genome fasta file, making sure the checksums are good, rebuilding the bowtie indices, and re-running.
ShortStack version 3.8.4
Thu Dec 14 09:43:44 EST 2017
hostname: comp-bc-0143.acib.production.int.aci.ics.psu.edu
working directory: /storage/work/mja18
Settings:
RNAfold_version: 2
bamfile: SS_1/SRR3721341.bam
dicermax: 24
dicermin: 20
error_logfile: SS_384/ErrorLogs.txt
foldsize: 300
genomefile: Triticum_aestivum.TGACv1.dna.toplevel.fa
logfile: SS_384/Log.txt
mincov: 0.5rpm
outdir: SS_384
pad: 75
sort_mem: 16G
strand_cutoff: 0.8
Run Progress and Messages:
Thu Dec 14 09:43:46 EST 2017
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
Done
Tally of primary alignments (INCLUDES unmapped, but EXCLUDES secondary, duplicate,
failed QC, and supplementary alignments): 8892343
Tally of PLACED primary alignments (EXCLUDES unmapped, secondary, duplicate,
failed QC, and supplementary alignments): 5267484
Thu Dec 14 09:51:29 EST 2017
Performing de-novo cluster identification and analyses.
At specified mincov of 0.5rpm with 5267484 placed primary reads,
mincov is 3 raw alignments
Completed at Thu Dec 14 10:29:10 EST 2017
Performing search for unplaced small RNAs.
Completed at Thu Dec 14 10:30:09 EST 2017
Thu Dec 14 10:30:10 EST 2017
Tally of loci by predominant RNA size (DicerCall):
DicerCall NotMIRNA MIRNA
N or NA 8884 0
20 1150 2
21 8590 74
22 1829 19
23 2774 3
24 59828 1
Unplaced small RNAs
Size MultiMapped NoAlignments
<20 6437 5707
20 2513 4030
21 6320 18115
22 3057 7317
23 2744 2918
24 10008 6732
24 3629 6247
On Thu, Dec 14, 2017 at 8:00 AM, juanmas07 notifications@github.com wrote:
This is my last run:
(wheat.fasta is the 13 GB genome file)
nohup ./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack_toplevelformatted_taes21 --genomefile /dev/wheat.fasta --bowtie_cores 3 &
ShortStack version 3.8.4 Thu Dec 14 07:46:57 EST 2017 hostname: vzmarcosjuarezubuntu working directory: /home/trigo/runs/miRNA-MITE Settings: RNAfold_version: 2 bowtie_cores: 3 bowtie_m: 50 dicermax: 24 dicermin: 20 error_logfile: files/output/sstack_toplevelformattedtaes21/ErrorLogs.txt foldsize: 300 genomefile: /dev/wheat.fasta logfile: files/output/sstack toplevelformatted_taes21/Log.txt mincov: 0.5rpm mismatches: 1 mmap: u
outdir: files/output/sstack_toplevelformatted_taes21 pad: 75 ranmax: 3 readfile: files/output/TAEs.21.fasta sort_mem: 768M strand_cutoff: 0.8 Run Progress and Messages: Thu Dec 14 07:46:58 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... [fai_build_core] different line length in sequence 'stitch_186'. Failed to make fai of stitched genome. Aborting Run
-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.
On Thu, Dec 14, 2017 at 9:56 AM, JM C juan.crescente@gmail.com wrote:
Sorry that was a typo, the genome file is 13G. I'm running it again and adding all information I can
-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.
On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell notifications@github.com wrote:
Hi again.
I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download.
Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else.
For comparison my testing environment is:
ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1
Hope this helps, Mike
On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 notifications@github.com wrote:
I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub
https://github.com/MikeAxtell/ShortStack/issues/ 67#issuecomment-351700258, or mute the thread
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351747357, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCu0ylKK1UATFbFgmKLWowKVGfBOs9cks5tAUGrgaJpZM4QyNnp .
I am going through old issues today; did you ever resolve this one?
He mike sorry for the delay. I still haven't had the chance to run it again, but I will soon and post an update here. Cheers
-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.
On Thu, Jan 25, 2018 at 12:03 PM, Mike Axtell notifications@github.com wrote:
I am going through old issues today; did you ever resolve this one?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-360492416, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCu01ncG31TBglgfMg4LN-X3SoVazgKks5tOJfIgaJpZM4QyNnp .
I was able to run the pipeline with the genome with no problems.
Hi all! Im having the exact same problem as the one reported here but im not able to solve it.
Im using an internal reference genome that was done at my company. The only thing is that this reference is at scaffold level and it contains around 10000 scaffolds but it has been validated and it works fine, but not sure if it could be the main reason for the problem.
In one of the coments this was mention as a possible source of the problem: "One or more sequence data lines of unequal length. For instance a line with 60 nts, followed by a line with 70nts, will cause samtools faidx to fail" But this is something that happens in any fasta reference right? all the lines in each chr/scaffold are 60 nts except for the last one that will have different length (between 1 - 60 nts).
When ShortStack is trying to create the index from the stitched file, samtools faidx crash, and this also happens when I try to run it directly with samtools faidx. However im able to run samtools faidx in my reference.fasta. I have check the stitched file and there were some empty lines that I have removed but still doesnt work.
Here I report the error while running ShortStack: ################### Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... Failed to make fai of stitched genome. Aborting Run ################### Here I report the error while running samtools faidx: ################### [fai_build_core] different line length in sequence 'stitch_0'. Could not build fai index ${path}/reference_stitched.fasta.fai ###################
Any idea why this is happening and how to solve it? Thanks!!
Thanks for the report. Because you can successfully samtools faidx
your original genome file, it does indeed sound like the "genome stitching" is to blame somehow; it must, under some circumstances, be making odd-length sequence lines.
If it's possible, you could share your genome file with me (directly email me might be best here), and I could try to figure it out. I don't think I'll be able to figure out the bug fix without the example case.
btw you are right that the last sequence line in any multi-FASTA file can be a shorter length. samtools faidx
doesn't complain about that .. it complains about internal line-length deviations.
Anyway if you are able to post or send me the offending genome as a test case, I can try and run it down.
Another quick idea you can look at .. ShortStack assumes (and never bother checking, bad on me) that your .fasta file has linux-like line endings \n
. If you have DOS-type \r\n
line endings, it might be the reason stitching is getting fouled up. Anyway you could quickly convert your .fasta file to force it to linux-style line breaks and see if that helps .. just a thought.
WOW!!! Thanks so much for your fast reply!! I have done dos2unix in order to get linux-like line ending and it worked!!!!!!!!!! The stitched index file has been created! :) Now ShortStack is running!!!
Thanks so much again!! you make my day!!!
Paloma
Glad to hear it.
I'm getting this while running:
This is the command I've used to run:
./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack2 --genomefile files/data/Triticum_aestivum.TGACv1.dna.toplevel.fa --bowtie_cores 3