RafaelMViana commented 9 years ago

Dear Mike, First of all, thank you for your ShortStack program. I really like all the output information that it provides. my issue is about the folding step. All the clusters analyzed for folding fail and no one goes further than step 7

sample N3 N5 N6 N7 total clusters
Sample1 47,743 1,853 2,650 21,724 73,970 Sample2 65,174 2,115 4,255 26,484 98,028 Sample3 52,013 1,976 5,433 21,953 81,375

It is indicated in the manual that failure at N7 can be (possible bug?). I wonder if it is a bug from shortstack, a problem with one of the versions of the programs i am using to support shortstack, or a problem with my samples and genome. I have checked some output sequences in mirbase, and I have found previously reported miRNAs. I know that ShortStack is designed to avoid FP, but it looks suspicious to me that the folding always fails in the same step

I am working with the the big fragmented genome of Norway Spruce (Picea Abies) which has round 10e6 contigs

I Trimmed my reads and selected sizes between 17-28nt. After that, I filtered against rRNA and tRNA, Then, I used shortstack for mapping and sRNA analysis

program versions:

ShortStack v3.0 perl v5.18.2 samtools v1.2 bowtie v0.12.9 bowtie-build v0.12.9 gzip v1.6 RNAfold v2.1.0

mapping output

      mapped(%)         Non mapped(%)

Sample1 2,637,848 (74.0) 923,575 (26.0) Sample2 3,786,372 (79.6) 973,801 (20.4) Sample3 3,258,433 (78.9) 873,168 (21.2)

ShortStack Settings

bowtie_m: 50 dicermax: 28 dicermin: 17 mincov: 2 mismatches: 1 mmap: f nostitch: 1 pad: 75 ranmax: 10

I would also appreciate your advice for any downstream analysis of such data

Thanks in advance for your time and advice

MikeAxtell commented 9 years ago

Thanks. We haven't seen that bug before. The 'N7' failure is an internal failure to get anything out of the RNAfold call for a locus. It shouldn't happen.

Can you send me the Log.txt file from this failed run and I will start from there...

MikeAxtell commented 9 years ago

Can you test this command on the system you are using?

echo AAAAAAAAAAAAAACCCCCCUUUUUUUUUUUUUUUUUU | RNAfold --noPS > test_stdout.txt 2> test_stderr.txt

What do you see in the test_stdout.txt and test_stderr.txt files? You should see the RNAfold result in the stdout, and nothing in the stderr. If not, then we have isolated the problem.

RafaelMViana commented 9 years ago

Hi Mike, It happened as you told:

stderr: empty

stdout:

AAAAAAAAAAAAAACCCCCCUUUUUUUUUUUUUUUUUU ((((((((((((((......)))))))))))))).... ( -6.60)

############################

Dr. Rafael M. Viana

Uppsala Biocenter, Almas Allé 5 Department of Plant Biology / Växtbiologi Swedish University of Agricultural Sciences

75651 Uppsala Sweden

############################

From: Mike Axtell notifications@github.com Sent: Friday, September 4, 2015 2:56 PM To: MikeAxtell/ShortStack Cc: Rafael Muñoz Subject: Re: [ShortStack] RNA folding always fails at step7 (#25)

Can you test this command on the system you are using?

echo AAAAAAAAAAAAAACCCCCCUUUUUUUUUUUUUUUUUU | RNAfold --noPS > test_stdout.txt 2> test_stderr.txt

What do you see in the test_stdout.txt and test_stderr.txt files? You should see the RNAfold result in the stdout, and nothing in the stderr. If not, then we have isolated the problem.

Reply to this email directly or view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/25#issuecomment-137728933.

RafaelMViana commented 9 years ago

Here you have the logs from 3 different samples

############################

Dr. Rafael M. Viana

Uppsala Biocenter, Almas Allé 5 Department of Plant Biology / Växtbiologi Swedish University of Agricultural Sciences

75651 Uppsala Sweden

############################

From: Mike Axtell notifications@github.com Sent: Friday, September 4, 2015 2:53 PM To: MikeAxtell/ShortStack Cc: Rafael Muñoz Subject: Re: [ShortStack] RNA folding always fails at step7 (#25)

Thanks. We haven't seen that bug before. The 'N7' failure is an internal failure to get anything out of the RNAfold call for a locus. It shouldn't happen.

Can you send me the Log.txt file from this failed run and I will start from there...

Reply to this email directly or view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/25#issuecomment-137728438.

ShortStack version 3.0 Thu Aug 20 17:54:53 CEST 2015 hostname: vbsg-hennig5 working directory: /media/diskc

Settings: RNAfold_version: 2 bowtie_cores: 20 bowtie_m: 50 dicermax: 28 dicermin: 17 genomefile: /media/diskc/genomes/spruce/full_spruce/bowtie1_index/full_spruce.fa logfile: /media/diskc/Spruce-sRNA/reads/spruce_1/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_1/Log.txt mincov: 2 mismatches: 1 mmap: f nostitch: 1 outdir: /media/diskc/Spruce-sRNA/reads/spruce_1/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_1 pad: 75 ranmax: 10 readfile: /media/diskc/Spruce-sRNA/reads/spruce_1/rRNA_filtered/local/tRNA_filtered/Spruce1_size_17-28_tRNA_rRNA_Picea_clean.fastq

Run Progress and Messages:

Thu Aug 20 17:54:53 CEST 2015 Starting alignment of /media/diskc/Spruce-sRNA/reads/spruce_1/rRNA_filtered/local/tRNA_filtered/Spruce1_size_17-28_tRNA_rRNA_Picea_clean.fastq with bowtie command: bowtie -q -v 1 -p 20 -S -a -m 50 --best --strata --sam-RG ID:Spruce1_size_17-28_tRNA_rRNA_Picea_clean /media/diskc/genomes/spruce/full_spruce/bowtie1_index/full_spruce - < /media/diskc/Spruce-sRNA/reads/spruce_1/rRNA_filtered/local/tRNA_filtered/Spruce1_size_17-28_tRNA_rRNA_Picea_clean.fastq Completed. Results are in temporary file /media/diskc/Spruce-sRNA/reads/spruce_1/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_1/Spruce1_size_17-28_tRNA_rRNA_Picea_clean_readsorted.sam.gz pending final processing Unique mappers: 937891 / 3561423 (26.3 %) Multi mappers: 1699957 / 3561423 (47.7 %) Multi mappers ignored and marked as unmapped: 119240 / 3561423 (3.3 %) Non mappers: 804335 / 3561423 (22.6 %)

Thu Aug 20 18:02:20 CEST 2015 Processing and sorting alignments Working on /media/diskc/Spruce-sRNA/reads/spruce_1/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_1/Spruce1_size_17-28_tRNA_rRNA_Picea_clean_readsorted.sam.gz Summary of primary alignments: XY:Z:N -- Unmapped because no valid alignments: 804335 / 3561423 (22.6 %) XY:Z:M -- Unmapped because alignment number exceeded option bowtie_m 50: 119240 / 3561423 (3.3 %) XY:Z:O -- Unmapped because alignment number exceeded option ranmax 10 and no guidance was possible: 3783 / 3561423 (0.1 %) XY:Z:U -- Uniquely mapped: 937891 / 3561423 (26.3 %) XY:Z:R -- Multi-mapped with primary alignment chosen randomly: 51046 / 3561423 (1.4 %) XY:Z:P -- Multi-mapped with primary alignment chosen based on f: 1645128 / 3561423 (46.2 %) Alignment completed. Final bam file is /media/diskc/Spruce-sRNA/reads/spruce_1/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_1/Spruce1_size_17-28_tRNA_rRNA_Picea_clean.bam

Thu Aug 20 18:12:07 CEST 2015 Performing de-novo cluster identification and analyses.

Sun Aug 23 21:21:57 CEST 2015 Tally of loci by predominant RNA size (DicerCall):

DicerCall NotMIRNA MIRNA N or NA 0 0 17 14172 0 18 10157 0 19 8280 0 20 7068 0 21 14423 0 22 4186 0 23 3630 0 24 5613 0 25 2342 0 26 1732 0 27 1386 0 28 981 0

ShortStack version 3.0 Sun Aug 23 21:22:15 CEST 2015 hostname: vbsg-hennig5 working directory: /media/diskc

Settings: RNAfold_version: 2 bowtie_cores: 20 bowtie_m: 50 dicermax: 28 dicermin: 17 genomefile: /media/diskc/genomes/spruce/full_spruce/bowtie1_index/full_spruce.fa logfile: /media/diskc/Spruce-sRNA/reads/spruce_2/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_2/Log.txt mincov: 2 mismatches: 1 mmap: f nostitch: 1 outdir: /media/diskc/Spruce-sRNA/reads/spruce_2/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_2 pad: 75 ranmax: 10 readfile: /media/diskc/Spruce-sRNA/reads/spruce_2/rRNA_filtered/local/tRNA_filtered/Spruce2_size_17-28_tRNA_rRNA_Picea_clean.fastq

Run Progress and Messages:

Sun Aug 23 21:22:15 CEST 2015 Starting alignment of /media/diskc/Spruce-sRNA/reads/spruce_2/rRNA_filtered/local/tRNA_filtered/Spruce2_size_17-28_tRNA_rRNA_Picea_clean.fastq with bowtie command: bowtie -q -v 1 -p 20 -S -a -m 50 --best --strata --sam-RG ID:Spruce2_size_17-28_tRNA_rRNA_Picea_clean /media/diskc/genomes/spruce/full_spruce/bowtie1_index/full_spruce - < /media/diskc/Spruce-sRNA/reads/spruce_2/rRNA_filtered/local/tRNA_filtered/Spruce2_size_17-28_tRNA_rRNA_Picea_clean.fastq Completed. Results are in temporary file /media/diskc/Spruce-sRNA/reads/spruce_2/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_2/Spruce2_size_17-28_tRNA_rRNA_Picea_clean_readsorted.sam.gz pending final processing Unique mappers: 1322228 / 4760173 (27.8 %) Multi mappers: 2464144 / 4760173 (51.8 %) Multi mappers ignored and marked as unmapped: 195901 / 4760173 (4.1 %) Non mappers: 777900 / 4760173 (16.3 %)

Sun Aug 23 21:41:29 CEST 2015 Processing and sorting alignments Working on /media/diskc/Spruce-sRNA/reads/spruce_2/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_2/Spruce2_size_17-28_tRNA_rRNA_Picea_clean_readsorted.sam.gz Summary of primary alignments: XY:Z:N -- Unmapped because no valid alignments: 777900 / 4760173 (16.3 %) XY:Z:M -- Unmapped because alignment number exceeded option bowtie_m 50: 195901 / 4760173 (4.1 %) XY:Z:O -- Unmapped because alignment number exceeded option ranmax 10 and no guidance was possible: 3397 / 4760173 (0.1 %) XY:Z:U -- Uniquely mapped: 1322228 / 4760173 (27.8 %) XY:Z:R -- Multi-mapped with primary alignment chosen randomly: 87098 / 4760173 (1.8 %) XY:Z:P -- Multi-mapped with primary alignment chosen based on f: 2373649 / 4760173 (49.9 %) Alignment completed. Final bam file is /media/diskc/Spruce-sRNA/reads/spruce_2/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_2/Spruce2_size_17-28_tRNA_rRNA_Picea_clean.bam

Sun Aug 23 21:54:26 CEST 2015 Performing de-novo cluster identification and analyses.

Thu Aug 27 17:15:38 CEST 2015 Tally of loci by predominant RNA size (DicerCall):

DicerCall NotMIRNA MIRNA N or NA 0 0 17 18994 0 18 13310 0 19 10929 0 20 9307 0 21 17572 0 22 5762 0 23 5167 0 24 10441 0 25 2891 0 26 1940 0 27 1226 0 28 489 0

ShortStack version 3.0 Thu Aug 27 17:16:00 CEST 2015 hostname: vbsg-hennig5 working directory: /media/diskc

Settings: RNAfold_version: 2 bowtie_cores: 20 bowtie_m: 50 dicermax: 28 dicermin: 17 genomefile: /media/diskc/genomes/spruce/full_spruce/bowtie1_index/full_spruce.fa logfile: /media/diskc/Spruce-sRNA/reads/spruce_3/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_3/Log.txt mincov: 2 mismatches: 1 mmap: f nostitch: 1 outdir: /media/diskc/Spruce-sRNA/reads/spruce_3/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_3 pad: 75 ranmax: 10 readfile: /media/diskc/Spruce-sRNA/reads/spruce_3/rRNA_filtered/local/tRNA_filtered/Spruce3_size_17-28_tRNA_rRNA_Picea_clean.fastq

Run Progress and Messages:

Thu Aug 27 17:16:00 CEST 2015 Starting alignment of /media/diskc/Spruce-sRNA/reads/spruce_3/rRNA_filtered/local/tRNA_filtered/Spruce3_size_17-28_tRNA_rRNA_Picea_clean.fastq with bowtie command: bowtie -q -v 1 -p 20 -S -a -m 50 --best --strata --sam-RG ID:Spruce3_size_17-28_tRNA_rRNA_Picea_clean /media/diskc/genomes/spruce/full_spruce/bowtie1_index/full_spruce - < /media/diskc/Spruce-sRNA/reads/spruce_3/rRNA_filtered/local/tRNA_filtered/Spruce3_size_17-28_tRNA_rRNA_Picea_clean.fastq Completed. Results are in temporary file /media/diskc/Spruce-sRNA/reads/spruce_3/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_3/Spruce3_size_17-28_tRNA_rRNA_Picea_clean_readsorted.sam.gz pending final processing Unique mappers: 1201737 / 4131601 (29.1 %) Multi mappers: 2056696 / 4131601 (49.8 %) Multi mappers ignored and marked as unmapped: 296566 / 4131601 (7.2 %) Non mappers: 576602 / 4131601 (14.0 %)

Thu Aug 27 17:24:35 CEST 2015 Processing and sorting alignments Working on /media/diskc/Spruce-sRNA/reads/spruce_3/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_3/Spruce3_size_17-28_tRNA_rRNA_Picea_clean_readsorted.sam.gz Summary of primary alignments: XY:Z:N -- Unmapped because no valid alignments: 576602 / 4131601 (14.0 %) XY:Z:M -- Unmapped because alignment number exceeded option bowtie_m 50: 296566 / 4131601 (7.2 %) XY:Z:O -- Unmapped because alignment number exceeded option ranmax 10 and no guidance was possible: 3563 / 4131601 (0.1 %) XY:Z:U -- Uniquely mapped: 1201737 / 4131601 (29.1 %) XY:Z:R -- Multi-mapped with primary alignment chosen randomly: 47010 / 4131601 (1.1 %) XY:Z:P -- Multi-mapped with primary alignment chosen based on f: 2006123 / 4131601 (48.6 %) Alignment completed. Final bam file is /media/diskc/Spruce-sRNA/reads/spruce_3/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_3/Spruce3_size_17-28_tRNA_rRNA_Picea_clean.bam

Thu Aug 27 17:35:04 CEST 2015 Performing de-novo cluster identification and analyses.

Sun Aug 30 21:56:56 CEST 2015 Tally of loci by predominant RNA size (DicerCall):

DicerCall NotMIRNA MIRNA N or NA 0 0 17 14124 0 18 10069 0 19 8367 0 20 7556 0 21 17028 0 22 5499 0 23 5059 0 24 7204 0 25 3173 0 26 2067 0 27 895 0 28 334 0

MikeAxtell commented 9 years ago

Darn.

Are you able to share one of the alignment files (for instance, media/diskc/Spruce-sRNA/reads/spruce_3/rRNA_filtered/local/tRNA_filtered/map_full/ShortStack_spruce_3/Spruce3_size_17-28_tRNA_rRNA_Picea_clean.bam) along with the genome (/media/diskc/genomes/spruce/full_spruce/bowtie1_index/full_spruce.fa) with me (you could use dropbox of google drive) .. that way I can here on my own system to try and reproduce and isolate this issue.

Since we've ruled out problem problem with RNAfold I think it must be that somehow no genomic intervals are being retrieved.

MikeAxtell commented 9 years ago

Hi again. Can you paste the contents of the genome index (the .fai file)?

RafaelMViana commented 9 years ago

The spruce genome is quite huge

The .fasta is 12.7 GB The .fai is 421.9 MB

The first 10 lines of the .fai look like this

MA_20601_len=208095 208095 21 80 81 MA_20603_len=198197 198197 210739 80 81 MA_159042_len=194057_putative_contaminant 194057 411457 80 81 MA_20595_len=168074 168074 607961 80 81 MA_20424_len=165328 165328 778157 80 81 MA_19111_len=163407 163407 945573 80 81 MA_20598_len=162097 162097 1111044 80 81 MA_10435771_len=161784 161784 1275192 80 81 MA_18863_len=161531 161531 1439020 80 81 MA_10432209_len=161390 161390 1602595 80 81

############################

Dr. Rafael M. Viana

Uppsala Biocenter, Almas Allé 5 Department of Plant Biology / Växtbiologi Swedish University of Agricultural Sciences

75651 Uppsala Sweden

############################

From: Mike Axtell notifications@github.com Sent: Friday, September 4, 2015 3:41 PM To: MikeAxtell/ShortStack Cc: Rafael Muñoz Subject: Re: [ShortStack] RNA folding always fails at step7 (#25)

Hi again. Can you paste the contents of the genome index (the .fai file)?

Reply to this email directly or view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/25#issuecomment-137740992.

MikeAxtell commented 9 years ago

Yowza that is big.

Can you test on your system whether the samtools faidx command can do sequence retrieval? I wonder about chromosome names with equal signs .. whether that might be causing an error.

Example (assuming the genome is indexed)

samtools faidx genome.fasta MA_20603_len=198197:1000-1500

You should get back a fasta chunk corresponding to positions 1000-1500 of chromosome MA_20603_len=198197

If you don't, then we've isolated the problem.

RafaelMViana commented 9 years ago

Yes I get it

MA_20603_len=198197:1000-1500 GGGTGCTAAAGACTTAATATAAAATATTCTCTTTAGCCGACTTTACTTCTGGGGTTGTGT CCAAGTATTTTCTCGCTTCCGAAATATTCTAATCTCATCAAAGGTCACAAATGGAGGTTA ATACCTCCCGCAACTACCATAATATCAACACTCCCTCCTAGCTAGGGAGGTATAAAAGGT AATGGTCTTGAGCACAAACAAATCATTACAAACAAATCATTCGACTGGTAGCATCATCAT AGCCACACATGCTATGTCATGATCAAATCATGAGCACCGTCAATATTATCCAACTGGATA ATCTCCATGACAATTAGTACAATGTCATATGATAGAACACAACTGTGTCTATCCATAGTA GTACATAACTACTATCCAGTGGTACACAACCACAACAAGTAGCACACGACTACTTAATCA TTGACACATAGTCACAACCAGTGGAACACATCCACCACGGTAACATACATCTACTATCCA GAGACATACAGTCTCCATAGC

############################

Dr. Rafael M. Viana

Uppsala Biocenter, Almas Allé 5 Department of Plant Biology / Växtbiologi Swedish University of Agricultural Sciences

75651 Uppsala Sweden

############################

From: Mike Axtell notifications@github.com Sent: Friday, September 4, 2015 3:54 PM To: MikeAxtell/ShortStack Cc: Rafael Muñoz Subject: Re: [ShortStack] RNA folding always fails at step7 (#25)

Yowza that is big.

Can you test on your system whether the samtools faidx command can do sequence retrieval? I wonder about chromosome names with equal signs .. whether that might be causing an error.

Example (assuming the genome is indexed)

samtools faidx genome.fasta MA_20603_len=198197:1000-1500

You should get back a fasta chunk corresponding to positions 1000-1500 of chromosome MA_20603_len=198197

If you don't, then we've isolated the problem.

Reply to this email directly or view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/25#issuecomment-137743625.

MikeAxtell commented 9 years ago

Thanks. So I am out ideas here. Can you send a bam file and the associated genome so I can test locally? My regular email is mja18@psu.edu

RafaelMViana commented 9 years ago

Thanks Mike,

I have to ask my boss about sending the whole .bams

On the other hand, I am trying to extract from the ban an interesting contig, MA_42375, that contains a known mir, pab-miR947a. This miR is present in 11 clusters, and fails folding with status N7

when I run grep MA42375 full_spruce_fa.fai

it outputs: MA_42375_len=60015

when I run (having the proper .bai) samtools view -h Spruce1_size_17-28_tRNA_rRNA_Picea_clean.bam MA_42375_len=60015 > MA_42375.sam

it outputs: [main_samview] region "MA_42375_len=60015" specifies an unknown reference name. Continue anyway.

I do not know if this helps

Kind regards

############################

Dr. Rafael M. Viana

Uppsala Biocenter, Almas Allé 5 Department of Plant Biology / Växtbiologi Swedish University of Agricultural Sciences

75651 Uppsala Sweden

############################

From: Mike Axtell notifications@github.com Sent: Friday, September 4, 2015 4:49 PM To: MikeAxtell/ShortStack Cc: Rafael Muñoz Subject: Re: [ShortStack] RNA folding always fails at step7 (#25)

Thanks. So I am out ideas here. Can you send a bam file and the associated genome so I can test locally? My regular email is mja18@psu.edumailto:mja18@psu.edu

Reply to this email directly or view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/25#issuecomment-137757295.

MikeAxtell commented 9 years ago

That's suspicious. samtools view is unable to use indexed lookup and that is a problem.

Can you paste what this looks like:

samtools view Spruce1_size_17-28_tRNA_rRNA_Picea_clean.bam | head

I am wondering whether bowtie didn't like your chromosome names, and modified them, or whether samtools view doesn't like them, and wont find them....

RafaelMViana commented 9 years ago

it looks like the contig names got truncated

HWI-ST344:303:C3NY3ACXX:8:1304:14575:73574 272 MA_20601 7005 255 17M * 0 0 * * XA:i:0 MD:Z:17 NM:i:0 XX:i:45 XY:Z:P XZ:f:0.002 RG:Z:Spruce1_size_17-28_tRNA_rRNA_Picea_clean HWI-ST344:303:C3NY3ACXX:8:2212:9199:93468 272 MA_20601 9376 255 24M * 0 0 * * XA:i:1 MD:Z:7G16 NM:i:1 XX:i:33 XY:Z:P XZ:f:0.029 RG:Z:Spruce1_size_17-28_tRNA_rRNA_Picea_clean HWI-ST344:303:C3NY3ACXX:8:2312:9349:59266 272 MA_20601 9664 255 24M * 0 0 * * XA:i:1 MD:Z:13C10 NM:i:1 XX:i:9 XY:Z:P XZ:f:0.091 RG:Z:Spruce1_size_17-28_tRNA_rRNA_Picea_clean HWI-ST344:303:C3NY3ACXX:8:1111:8518:63546 256 MA_20601 11403 255 17M * 0 0 * * XA:i:1 MD:Z:11T5 NM:i:1 XX:i:29 XY:Z:P XZ:f:0.024 RG:Z:Spruce1_size_17-28_tRNA_rRNA_Picea_clean HWI-ST344:303:C3NY3ACXX:8:2112:13431:39600 272 MA_20601 14862 255 17M * 0 0 * * XA:i:1 MD:Z:2A14 NM:i:1 XX:i:32 XY:Z:P XZ:f:0.03 RG:Z:Spruce1_size_17-28_tRNA_rRNA_Picea_clean HWI-ST344:303:C3NY3ACXX:8:1316:11250:11607 256 MA_20601 19149 255 24M * 0 0 * * XA:i:0 MD:Z:24 NM:i:0 XX:i:10 XY:Z:P XZ:f:0.083 RG:Z:Spruce1_size_17-28_tRNA_rRNA_Picea_clean HWI-ST344:303:C3NY3ACXX:8:2102:1802:75900 272 MA_20601 23262 255 17M * 0 0 * * XA:i:1 MD:Z:3T13 NM:i:1 XX:i:7 XY:Z:P XZ:f:0.125 RG:Z:Spruce1_size_17-28_tRNA_rRNA_Picea_clean HWI-ST344:303:C3NY3ACXX:8:2116:18452:25763 272 MA_20601 29583 255 17M * 0 0 * * XA:i:1 MD:Z:10T6 NM:i:1 XX:i:35 XY:Z:P XZ:f:0.009 RG:Z:Spruce1_size_17-28_tRNA_rRNA_Picea_clean HWI-ST344:303:C3NY3ACXX:8:1207:10687:16830 256 MA_20601 35679 255 20M * 0 0 * * XA:i:1 MD:Z:3A16 NM:i:1 XX:i:15 XY:Z:P XZ:f:0.06 RG:Z:Spruce1_size_17-28_tRNA_rRNA_Picea_clean HWI-ST344:303:C3NY3ACXX:8:2116:19224:48573 272 MA_20601 39333 255 21M * 0 0 * * XA:i:0 MD:Z:21 NM:i:0 XX:i:35 XY:Z:P XZ:f:0.027 RG:Z:Spruce1_size_17-28_tRNA_rRNA_Picea_clean

############################

Dr. Rafael M. Viana

Uppsala Biocenter, Almas Allé 5 Department of Plant Biology / Växtbiologi Swedish University of Agricultural Sciences

75651 Uppsala Sweden

############################

From: Mike Axtell notifications@github.com Sent: Friday, September 4, 2015 5:33 PM To: MikeAxtell/ShortStack Cc: Rafael Muñoz Subject: Re: [ShortStack] RNA folding always fails at step7 (#25)

That's suspicious. samtools view is unable to use indexed lookup and that is a problem.

Can you paste what this looks like:

samtools view Spruce1_size_17-28_tRNA_rRNA_Picea_clean.bam | head

I am wondering whether bowtie didn't like your chromosome names, and modified them, or whether samtools view doesn't like them, and wont find them....

Reply to this email directly or view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/25#issuecomment-137769854.

MikeAxtell commented 9 years ago

So it's not clear to me why your contig names were truncated. I did a test where I used similar chromosome names as yours and ran a ShortStack analysis, as follows:

$ cat test_genome.fasta

MA_20601_len=208095 AGCTAGCATACGACCTACAGGACTACGGACCAG MA_20603_len=198197 AGCTCAACGATTCAGCATCAGCATCAGACGCACAGCACAGCATGTAC MA_159042_len=194057_putative_contaminant CGACTCAGCATCGGGGCGGCTATCATTTCTATCTTTA

cat test_reads.fasta

read1 GCATACGACCTACAGGACTA read2 CTCAGCATCGGGGCGGCTATCAT

$ ShortStack --readfile test_reads.fasta --genomefile test_genome.fasta --align_only

$ samtools view -h test_reads.bam @HD VN:1.0 SO:coordinate @SQ SN:MA_20601_len=208095 LN:33 @SQ SN:MA_20603_len=198197 LN:47 @SQ SN:MA_159042_len=194057_putative_contaminant LN:37 @RG ID:test_reads @PG ID:Bowtie VN:0.12.9 CL:"bowtie -f -v 1 -p 1 -S -a -m 50 --best --strata --sam-RG ID:test_reads test_genome -" @PG ID:ShortStack VN:3.3 CL:"--readfile test_reads.fasta --genomefile test_genome.fasta --align_only" read1 0 MA_20601_len=208095 6 255 20M * 0 0 GCATACGACCTACAGGACTA IIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:20 NM:i:RG:Z:test_reads XX:i:1 XY:Z:U XZ:f:1 read2 0 MA_159042_len=194057_putative_contaminant 4 255 23M * 0 0 CTCAGCATCGGGGCGGCTATCAT IIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:23 NM:i:0 RG:Z:test_reads XX:i:1 XY:Z:U XZ:f:1

.. As you can see, the format of your contig names did not cause bowtie, samtools, or ShortStack to truncate the contig names on my system.

I'm using ShortStack 3.3, but I don't think your use of 3.0 would matter. Other versions:

bowtie and bowtie-build 0.12.9 (same as you) samtools 1.2 (same as you).

So I really am stuck here. I bet the truncated chromosome names in your bam file are the issue. I just don't know how they are being produced, since when I use similar names in testing here, I don't get that result.

MikeAxtell commented 9 years ago

I guess maybe you could try editing the names of your contigs (substitute = with _ perhaps) and re-run .. maybe there is some difference in our systems...

RafaelMViana commented 9 years ago

I have edited the contig names in the .fa, created new .fai and rerun with the same parameters as the previous run. I will tell you about the output when is finished.

Thanks a lot for your help

############################

Dr. Rafael M. Viana

Uppsala Biocenter, Almas Allé 5 Department of Plant Biology / Växtbiologi Swedish University of Agricultural Sciences

75651 Uppsala Sweden

############################

From: Mike Axtell notifications@github.com Sent: Friday, September 4, 2015 7:38 PM To: MikeAxtell/ShortStack Cc: Rafael Muñoz Subject: Re: [ShortStack] RNA folding always fails at step7 (#25)

I guess maybe you could try editing the names of your contigs (substitute = with _ perhaps) and re-run .. maybe there is some difference in our systems...

Reply to this email directly or view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/25#issuecomment-137801698.

MikeAxtell commented 8 years ago

Hi Rafael, Just following up on this issue. Were you able to resolve it?

RafaelMViana commented 8 years ago

Dear Mike,

Thanks for your interest, I am currently analyzing ShortStack output yes, it worked and i got miRNAs This time the log looked like this N3 N5 N6 N8 N10 N11 N12 N13 N14 N15 Yes sample_1 47,836 1,824 2,674 19 5 16,064 2,377 1,649 414 1,199 42 sample_2 65,413 2,131 4,166 14 2 19,541 3,036 1,977 414 1,409 49 sample_3 52,175 2,025 5,514 15 6 15,997 2,600 1,694 343 1,194 47

most of the miRNAs were new. I checked my mapped reads vs mirbase to look for False negatives (FN) and got:

                FN    TP  new    total

sample_1 24 1 41 66

sample_2 27 2 47 76 sample_3 27 1 47 75

TP are known miRNAs detected by SStack FN are known miRNAs not detected by SStack new are unknown miRNAs detected by SStack

I have defined a cluster with at least 2 reads, and some new detected miRNAs come for clusters with 4,5,6 or lss than 10 reads in cluster

        reads in clusters detected as new miRNA sorted by min to max

sample_1 4 6 19 19 26 42 46 54 66 89 sample_2 16 17 18 24 30 59 61 75 77 90 sample_3 4 5 6 7 8 8 8 8 19 21

on the other hand I have FN with more than 1000 reads in the cluster

        Reads in clusters with FN miRNA sorted by min to max
                 2    3    4    5    6    7    8    9    10    >10    >100    >1000    total

sample_1 3 2 1 1 0 0 0 1 2 15 12 11 24 sample_2 3 5 1 1 0 0 0 1 0 18 14 11 27 sample_3 3 2 3 1 2 0 1 0 1 15 13 11 27

I am currently looking for homology of the new species in mir-base

I hope that this can add something for future software improvement Any comment is appreciated

Thanks again for your interest Kind regards

############################

Dr. Rafael M. Viana

Uppsala Biocenter, Almas Allé 5 Department of Plant Biology / Växtbiologi Swedish University of Agricultural Sciences

75651 Uppsala Sweden

############################

From: Mike Axtell notifications@github.com Sent: Monday, October 5, 2015 3:33 PM To: MikeAxtell/ShortStack Cc: Rafael Muñoz Subject: Re: [ShortStack] RNA folding always fails at step7 (#25)

Hi Rafael, Just following up on this issue. Were you able to resolve it?

Reply to this email directly or view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/25#issuecomment-145528839.

MikeAxtell commented 8 years ago

Thanks.

I would be cautious about calling previously annotated false negatives in this analysis. Certainly some of them are, but others may have been poor annotations to begin with. anyway, it seems this issue is resolved. Thanks for your comments.

MikeAxtell / ShortStack

RNA folding always fails at step7 #25

program versions:

mapping output