MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

Failed to make fai #67

Closed juancresc closed 3 years ago

juancresc commented 6 years ago

I'm getting this while running:

Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
[fai_build_core] different line length in sequence 'stitch_167'.
        Failed to make fai of stitched genome. Aborting Run

This is the command I've used to run:

./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack2 --genomefile files/data/Triticum_aestivum.TGACv1.dna.toplevel.fa --bowtie_cores 3

MikeAxtell commented 6 years ago

Yes, I've seen this before. It results from an improperly FASTA format in the genome file. The samtools faidx command that ShortStack is calling fails. You can verify by simply testing the command

samtools faidx files/data/Triticum_aestivum.TGACv1.dna.toplevel.fa

If I'm right you'll get the same error and a failure to make the .fai genome index.

When I've seen it before it has been one of two things:

  1. One or more completely empty lines in the FASTA file (i.e. that match the regex ^$).

  2. One or more sequence data lines of unequal length. For instance a line with 60 nts, followed by a line with 70nts, will cause samtools faidx to fail.

I also noticed that when I googled using just your genome file name 'Triticum_aestivum.TGACv1.dna.toplevel.fa' among the top hits were complaints from other people who've tried to work with that particular file. That is consistent with my hypothesis of an incorrectly formatted (or corrupted) FASTA file for this genome build.

Hope this helps and let me know.

juancresc commented 6 years ago

Just to notice, the same run, twice one after another. I've also run samtools and works fine. I don't know why the first run throws the error.

Anyway working fine now

Run Progress and Messages:

Mon Dec  4 11:58:39 EST 2017
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
        Failed to make fai of stitched genome. Aborting Run
Run Progress and Messages:
Use of uninitialized value $fai_fields[1] in numeric lt (<) at ./files/libs/ShortStack/ShortStack line 1009, <FAI> line 136610.

Mon Dec  4 12:00:20 EST 2017
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
        Done
MikeAxtell commented 6 years ago

Weird. But I'm glad you are up and running.

On Mon, Dec 4, 2017 at 12:02 PM, juanmas07 notifications@github.com wrote:

Just to notice, the same run, twice one after another. I've also run samtools and works fine. I don't know why the first run throws the error.

Anyway working fine now

Run Progress and Messages:

Mon Dec 4 11:58:39 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... Failed to make fai of stitched genome. Aborting Run

Run Progress and Messages: Use of uninitialized value $fai_fields[1] in numeric lt (<) at ./files/libs/ShortStack/ShortStack line 1009, line 136610.

Mon Dec 4 12:00:20 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... Done

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-349028284, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXieXfsKN3o2cunf0oyLFazvHLfnoYks5s9CWTgaJpZM4QyNnp .

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

juancresc commented 6 years ago

Seems like the issue is in the stiched file:


samtools faidx files/output/sstack_toplevelformatted_taes21/wheat_stitched.fasta 
[fai_build_core] different line length in sequence 'stitch_54'.

I'm able to run samtools in the genomefile, but not in the stiched

MikeAxtell commented 6 years ago

Weird. Can you send me the url where I can find the exact genome file you are working with, so I replicate the error here? I'll mark this as a bug. Once I get the exact genome file I will be able to begin testing.

Thanks, Mike

On Tue, Dec 12, 2017 at 7:12 AM, juanmas07 notifications@github.com wrote:

Seems like the issue is in the stiched file:

samtools faidx files/output/sstack_toplevelformatted_taes21/wheat_stitched.fasta [fai_build_core] different line length in sequence 'stitch_54'.

I'm able to run samtools in the genomefile, but not in the stiched

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351034117, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXifD0rtlZrumk5zdhJkcZbMCl3ho2ks5s_m3DgaJpZM4QyNnp .

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

juancresc commented 6 years ago

I'm using this genome:

ftp://ftp.ensemblgenomes.org/pub/plants/release-37/fasta/triticum_aestivum/dna/Triticum_aestivum.TGACv1.dna.toplevel.fa.gz

I'm also using a formatted version, with the same results. For formatting I've used a biopython script:


#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

def FAfixer(input_fasta, output_fasta):
    """
    """
    buffer_seqs = []
    for seq_record in SeqIO.parse(input_fasta, "fasta"):
        record = SeqRecord(seq_record.seq, id=seq_record.id, description=seq_record.description)
        buffer_seqs.append(record)
    SeqIO.write(buffer_seqs, output_fasta, "fasta")

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()#pylint: disable=invalid-name
    parser.add_argument("-i", "--input_fasta", help="(.fasta)", required=True)
    parser.add_argument("-o", "--output_fasta", help="(.fasta)", required=True)
    args = parser.parse_args()#pylint: disable=invalid-name
    FAfixer(args.input_fasta, args.output_fasta)
MikeAxtell commented 6 years ago

Thanks I will take a look at this as soon as I can and get back to you

On Tue, Dec 12, 2017 at 7:38 AM, juanmas07 notifications@github.com wrote:

I'm using this genome:

ftp://ftp.ensemblgenomes.org/pub/plants/release-37/fasta/ triticum_aestivum/dna/Triticum_aestivum.TGACv1.dna.toplevel.fa.gz

I'm also using a formatted version, with the same results. For formatting I've used a biopython script:

!/usr/bin/env python

-- coding: utf-8 --

from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq

def FAfixer(input_fasta, output_fasta): """ """ buffer_seqs = [] for seq_record in SeqIO.parse(input_fasta, "fasta"): record = SeqRecord(seq_record.seq, id=seq_record.id, description=seq_record.description) buffer_seqs.append(record) SeqIO.write(buffer_seqs, output_fasta, "fasta")

if name == "main": import argparse parser = argparse.ArgumentParser()#pylint: disable=invalid-name parser.add_argument("-i", "--input_fasta", help="(.fasta)", required=True) parser.add_argument("-o", "--output_fasta", help="(.fasta)", required=True) args = parser.parse_args()#pylint: disable=invalid-name FAfixer(args.input_fasta, args.output_fasta)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351039454, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXibXY0ZhjjTkEAkhltps39f7NHx6Tks5s_nO4gaJpZM4QyNnp .

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

juancresc commented 6 years ago

I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9

MikeAxtell commented 6 years ago

Hi again.

I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download.

Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else.

For comparison my testing environment is:

ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1

Hope this helps, Mike

On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 notifications@github.com wrote:

I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

juancresc commented 6 years ago

Sorry that was a typo, the genome file is 13G. I'm running it again and adding all information I can

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell notifications@github.com wrote:

Hi again.

I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download.

Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else.

For comparison my testing environment is:

ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1

Hope this helps, Mike

On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 notifications@github.com wrote:

I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351700258, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCu0wwOAxv_L0WKUyLNi7vmEVIlKKxMks5tARdzgaJpZM4QyNnp .

juancresc commented 6 years ago

This is my last run:

(wheat.fasta is the 13 GB genome file)

nohup ./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack_toplevelformatted_taes21 --genomefile /dev/wheat.fasta --bowtie_cores 3 &

ShortStack version 3.8.4 Thu Dec 14 07:46:57 EST 2017 hostname: vzmarcosjuarezubuntu working directory: /home/trigo/runs/miRNA-MITE Settings: RNAfold_version: 2 bowtie_cores: 3 bowtie_m: 50 dicermax: 24 dicermin: 20 error_logfile: files/output/sstack_toplevelformattedtaes21/ErrorLogs.txt foldsize: 300 genomefile: /dev/wheat.fasta logfile: files/output/sstack toplevelformatted_taes21/Log.txt mincov: 0.5rpm mismatches: 1 mmap: u outdir: files/output/sstack_toplevelformatted_taes21 pad: 75 ranmax: 3 readfile: files/output/TAEs.21.fasta sort_mem: 768M strand_cutoff: 0.8 Run Progress and Messages: Thu Dec 14 07:46:58 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... [fai_build_core] different line length in sequence 'stitch_186'. Failed to make fai of stitched genome. Aborting Run

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Dec 14, 2017 at 9:56 AM, JM C juan.crescente@gmail.com wrote:

Sorry that was a typo, the genome file is 13G. I'm running it again and adding all information I can

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell notifications@github.com wrote:

Hi again.

I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download.

Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else.

For comparison my testing environment is:

ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1

Hope this helps, Mike

On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 notifications@github.com wrote:

I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351700258, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCu0wwOAxv_L0WKUyLNi7vmEVIlKKxMks5tARdzgaJpZM4QyNnp .

MikeAxtell commented 6 years ago

Hi Juan,

I noticed you were using ShortStack 3.8.4. This is the latest commit on github but is not yet an "official" release. See https://github.com/MikeAxtell/ShortStack/releases for the official releases. That said, it shouldn't make a difference because the changes between the latest commit you used and the last official release won't affect this error. To make sure, I re-tested with the wheat genome on my system with 3.8.4. I had no problems completing a run and was not able to reproduce your error (Log.txt copied below).

So, I'm not sure what to tell you. I can't reproduce the problem here with the same genome. The error message you are getting comes from samtools faidx being unable to index the "stitched" genome, which is an internal construct that ShortStack makes to speed sequence retrievals for RNA folding (highly fragmented genomes really slow down sequence retrievals). The error indicates that the stitched genome is corrupted in some way such that samtools can't index it. But since I can't reproduce the error I'm not sure what else to do. I'm not convinced that it is a ShortStack bug.

You had mentioned that you pre-processed the genome and showed some code. I would suggest re-downloading the original genome fasta file, making sure the checksums are good, rebuilding the bowtie indices, and re-running.

ShortStack version 3.8.4

Thu Dec 14 09:43:44 EST 2017

hostname: comp-bc-0143.acib.production.int.aci.ics.psu.edu

working directory: /storage/work/mja18

Settings:

RNAfold_version: 2

bamfile: SS_1/SRR3721341.bam

dicermax: 24

dicermin: 20

error_logfile: SS_384/ErrorLogs.txt

foldsize: 300

genomefile: Triticum_aestivum.TGACv1.dna.toplevel.fa

logfile: SS_384/Log.txt

mincov: 0.5rpm

outdir: SS_384

pad: 75

sort_mem: 16G

strand_cutoff: 0.8

Run Progress and Messages:

Thu Dec 14 09:43:46 EST 2017

Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...

Done

Tally of primary alignments (INCLUDES unmapped, but EXCLUDES secondary, duplicate,

failed QC, and supplementary alignments): 8892343

Tally of PLACED primary alignments (EXCLUDES unmapped, secondary, duplicate,

failed QC, and supplementary alignments): 5267484

Thu Dec 14 09:51:29 EST 2017

Performing de-novo cluster identification and analyses.

At specified mincov of 0.5rpm with 5267484 placed primary reads,

mincov is 3 raw alignments

Completed at Thu Dec 14 10:29:10 EST 2017

Performing search for unplaced small RNAs.

Completed at Thu Dec 14 10:30:09 EST 2017

Thu Dec 14 10:30:10 EST 2017

Tally of loci by predominant RNA size (DicerCall):

DicerCall NotMIRNA MIRNA

N or NA 8884 0

20 1150 2

21 8590 74

22 1829 19

23 2774 3

24 59828 1

Unplaced small RNAs

Size MultiMapped NoAlignments

<20 6437 5707

20 2513 4030

21 6320 18115

22 3057 7317

23 2744 2918

24 10008 6732

24 3629 6247

On Thu, Dec 14, 2017 at 8:00 AM, juanmas07 notifications@github.com wrote:

This is my last run:

(wheat.fasta is the 13 GB genome file)

nohup ./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack_toplevelformatted_taes21 --genomefile /dev/wheat.fasta --bowtie_cores 3 &

ShortStack version 3.8.4 Thu Dec 14 07:46:57 EST 2017 hostname: vzmarcosjuarezubuntu working directory: /home/trigo/runs/miRNA-MITE Settings: RNAfold_version: 2 bowtie_cores: 3 bowtie_m: 50 dicermax: 24 dicermin: 20 error_logfile: files/output/sstack_toplevelformattedtaes21/ErrorLogs.txt foldsize: 300 genomefile: /dev/wheat.fasta logfile: files/output/sstack toplevelformatted_taes21/Log.txt mincov: 0.5rpm mismatches: 1 mmap: u outdir: files/output/sstack_toplevelformatted_taes21 pad: 75 ranmax: 3 readfile: files/output/TAEs.21.fasta sort_mem: 768M strand_cutoff: 0.8 Run Progress and Messages: Thu Dec 14 07:46:58 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... [fai_build_core] different line length in sequence 'stitch_186'. Failed to make fai of stitched genome. Aborting Run

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Dec 14, 2017 at 9:56 AM, JM C juan.crescente@gmail.com wrote:

Sorry that was a typo, the genome file is 13G. I'm running it again and adding all information I can

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell notifications@github.com wrote:

Hi again.

I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download.

Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else.

For comparison my testing environment is:

ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1

Hope this helps, Mike

On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 notifications@github.com wrote:

I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub

https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351700258, or mute the thread

https://github.com/notifications/unsubscribe-auth/ACCu0wwOAxv_L0WKUyLNi7vmEVIlKKxMks5tARdzgaJpZM4QyNnp .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

juancresc commented 6 years ago

Dear Mike,

Thank you for your detailed explanation, I'll continue trying till I get it working. Once I have a solution I'll update you.

Thanks again and kind regards,

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Dec 14, 2017 at 12:41 PM, Mike Axtell notifications@github.com wrote:

Hi Juan,

I noticed you were using ShortStack 3.8.4. This is the latest commit on github but is not yet an "official" release. See https://github.com/MikeAxtell/ShortStack/releases for the official releases. That said, it shouldn't make a difference because the changes between the latest commit you used and the last official release won't affect this error. To make sure, I re-tested with the wheat genome on my system with 3.8.4. I had no problems completing a run and was not able to reproduce your error (Log.txt copied below).

So, I'm not sure what to tell you. I can't reproduce the problem here with the same genome. The error message you are getting comes from samtools faidx being unable to index the "stitched" genome, which is an internal construct that ShortStack makes to speed sequence retrievals for RNA folding (highly fragmented genomes really slow down sequence retrievals). The error indicates that the stitched genome is corrupted in some way such that samtools can't index it. But since I can't reproduce the error I'm not sure what else to do. I'm not convinced that it is a ShortStack bug.

You had mentioned that you pre-processed the genome and showed some code. I would suggest re-downloading the original genome fasta file, making sure the checksums are good, rebuilding the bowtie indices, and re-running.

ShortStack version 3.8.4

Thu Dec 14 09:43:44 EST 2017

hostname: comp-bc-0143.acib.production.int.aci.ics.psu.edu

working directory: /storage/work/mja18

Settings:

RNAfold_version: 2

bamfile: SS_1/SRR3721341.bam

dicermax: 24

dicermin: 20

error_logfile: SS_384/ErrorLogs.txt

foldsize: 300

genomefile: Triticum_aestivum.TGACv1.dna.toplevel.fa

logfile: SS_384/Log.txt

mincov: 0.5rpm

outdir: SS_384

pad: 75

sort_mem: 16G

strand_cutoff: 0.8

Run Progress and Messages:

Thu Dec 14 09:43:46 EST 2017

Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...

Done

Tally of primary alignments (INCLUDES unmapped, but EXCLUDES secondary, duplicate,

failed QC, and supplementary alignments): 8892343

Tally of PLACED primary alignments (EXCLUDES unmapped, secondary, duplicate,

failed QC, and supplementary alignments): 5267484

Thu Dec 14 09:51:29 EST 2017

Performing de-novo cluster identification and analyses.

At specified mincov of 0.5rpm with 5267484 placed primary reads,

mincov is 3 raw alignments

Completed at Thu Dec 14 10:29:10 EST 2017

Performing search for unplaced small RNAs.

Completed at Thu Dec 14 10:30:09 EST 2017

Thu Dec 14 10:30:10 EST 2017

Tally of loci by predominant RNA size (DicerCall):

DicerCall NotMIRNA MIRNA

N or NA 8884 0

20 1150 2

21 8590 74

22 1829 19

23 2774 3

24 59828 1

Unplaced small RNAs

Size MultiMapped NoAlignments

<20 6437 5707

20 2513 4030

21 6320 18115

22 3057 7317

23 2744 2918

24 10008 6732

24 3629 6247

On Thu, Dec 14, 2017 at 8:00 AM, juanmas07 notifications@github.com wrote:

This is my last run:

(wheat.fasta is the 13 GB genome file)

nohup ./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack_toplevelformatted_taes21 --genomefile /dev/wheat.fasta --bowtie_cores 3 &

ShortStack version 3.8.4 Thu Dec 14 07:46:57 EST 2017 hostname: vzmarcosjuarezubuntu working directory: /home/trigo/runs/miRNA-MITE Settings: RNAfold_version: 2 bowtie_cores: 3 bowtie_m: 50 dicermax: 24 dicermin: 20 error_logfile: files/output/sstack_toplevelformattedtaes21/ErrorLogs.txt foldsize: 300 genomefile: /dev/wheat.fasta logfile: files/output/sstack toplevelformatted_taes21/Log.txt mincov: 0.5rpm mismatches: 1 mmap: u

outdir: files/output/sstack_toplevelformatted_taes21 pad: 75 ranmax: 3 readfile: files/output/TAEs.21.fasta sort_mem: 768M strand_cutoff: 0.8 Run Progress and Messages: Thu Dec 14 07:46:58 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... [fai_build_core] different line length in sequence 'stitch_186'. Failed to make fai of stitched genome. Aborting Run

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Dec 14, 2017 at 9:56 AM, JM C juan.crescente@gmail.com wrote:

Sorry that was a typo, the genome file is 13G. I'm running it again and adding all information I can

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell notifications@github.com wrote:

Hi again.

I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download.

Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else.

For comparison my testing environment is:

ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1

Hope this helps, Mike

On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 notifications@github.com wrote:

I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub

https://github.com/MikeAxtell/ShortStack/issues/ 67#issuecomment-351700258, or mute the thread

https://github.com/notifications/unsubscribe-auth/ACCu0wwOAxv_ L0WKUyLNi7vmEVIlKKxMks5tARdzgaJpZM4QyNnp .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-351747357, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCu0ylKK1UATFbFgmKLWowKVGfBOs9cks5tAUGrgaJpZM4QyNnp .

MikeAxtell commented 6 years ago

I am going through old issues today; did you ever resolve this one?

juancresc commented 6 years ago

He mike sorry for the delay. I still haven't had the chance to run it again, but I will soon and post an update here. Cheers

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Jan 25, 2018 at 12:03 PM, Mike Axtell notifications@github.com wrote:

I am going through old issues today; did you ever resolve this one?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/67#issuecomment-360492416, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCu01ncG31TBglgfMg4LN-X3SoVazgKks5tOJfIgaJpZM4QyNnp .

juancresc commented 6 years ago

I was able to run the pipeline with the genome with no problems.

palperbel commented 3 years ago

Hi all! Im having the exact same problem as the one reported here but im not able to solve it.

Im using an internal reference genome that was done at my company. The only thing is that this reference is at scaffold level and it contains around 10000 scaffolds but it has been validated and it works fine, but not sure if it could be the main reason for the problem.

In one of the coments this was mention as a possible source of the problem: "One or more sequence data lines of unequal length. For instance a line with 60 nts, followed by a line with 70nts, will cause samtools faidx to fail" But this is something that happens in any fasta reference right? all the lines in each chr/scaffold are 60 nts except for the last one that will have different length (between 1 - 60 nts).

When ShortStack is trying to create the index from the stitched file, samtools faidx crash, and this also happens when I try to run it directly with samtools faidx. However im able to run samtools faidx in my reference.fasta. I have check the stitched file and there were some empty lines that I have removed but still doesnt work.

Here I report the error while running ShortStack: ################### Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... Failed to make fai of stitched genome. Aborting Run ################### Here I report the error while running samtools faidx: ################### [fai_build_core] different line length in sequence 'stitch_0'. Could not build fai index ${path}/reference_stitched.fasta.fai ###################

Any idea why this is happening and how to solve it? Thanks!!

MikeAxtell commented 3 years ago

Thanks for the report. Because you can successfully samtools faidx your original genome file, it does indeed sound like the "genome stitching" is to blame somehow; it must, under some circumstances, be making odd-length sequence lines.

If it's possible, you could share your genome file with me (directly email me might be best here), and I could try to figure it out. I don't think I'll be able to figure out the bug fix without the example case.

btw you are right that the last sequence line in any multi-FASTA file can be a shorter length. samtools faidx doesn't complain about that .. it complains about internal line-length deviations.

Anyway if you are able to post or send me the offending genome as a test case, I can try and run it down.

MikeAxtell commented 3 years ago

Another quick idea you can look at .. ShortStack assumes (and never bother checking, bad on me) that your .fasta file has linux-like line endings \n. If you have DOS-type \r\n line endings, it might be the reason stitching is getting fouled up. Anyway you could quickly convert your .fasta file to force it to linux-style line breaks and see if that helps .. just a thought.

palperbel commented 3 years ago

WOW!!! Thanks so much for your fast reply!! I have done dos2unix in order to get linux-like line ending and it worked!!!!!!!!!! The stitched index file has been created! :) Now ShortStack is running!!!

Thanks so much again!! you make my day!!!

Paloma

MikeAxtell commented 3 years ago

Glad to hear it.