MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

How to deal with ShortStack Counts.txt files? #59

Closed DiegoZavallo closed 7 years ago

DiegoZavallo commented 7 years ago

Hi, I'm recently discovered ShortStack and I´m thrilled with the idea on how to deal with the multimapping reads. Until this point, I was using bowtie with -a option, but after reading your paper "Improved placement of multimapping sRNAs" I realize now that this option is even worst that -m1 o random. My original "pipeline" or "workflow" was: Bowtie (for mapping) HTSeq-count (for counting) and edgeR (for differential expression analysis, although I always thought that it should be call differential accumulation when we deal with sRNAs). When I ran ShortStack I used the Counts.txt files (after I parsed it) as count matrix file (such as the HTseq-count output) as input for the edgeR. Is that possible? How do you analyze differential accumulation between treatments?

I also try to use the bam file output from ShortStack directly with HTSeq-count but the matrix was different when comparing with the above method. I did this because I think that after ShortStack does the distribution of the multimapping reads they became as unique and the HTSeq-count will not discard them. Am I thinking correctly?

Can you give me any hint? Thanks in advance

Diego

MikeAxtell commented 7 years ago

Thanks Diego, I am glad you are finding ShortStack useful.

Yes, we regularly use the Counts.txt as input for differential expression analyses. This is what I had in mind when I made the Counts.txt output. We mostly use DESeq2, but have also used EdgeR in the past.

I don't know about using HTSeq-count directly from bam files; I've never used that method so can't comment there.

On Fri, Jun 16, 2017 at 12:47 PM, DiegoZvallo notifications@github.com wrote:

Hi, I'm recently discovered ShortStack and I´m thrilled with the idea on how to deal with the multimapping reads. Until this point, I was using bowtie with -a option, but after reading your paper "Improved placement of multimapping sRNAs" I realize now that this option is even worst that -m1 o random. My original "pipeline" or "workflow" was: Bowtie (for mapping) HTSeq-count (for counting) and edgeR (for differential expression analysis, although I always thought that it should be call differential accumulation when we deal with sRNAs). When I ran ShortStack I used the Counts.txt files (after I parsed it) as count matrix file (such as the HTseq-count output) as input for the edgeR. Is that possible? How do you analyze differential accumulation between treatments?

I also try to use the bam file output from ShortStack directly with HTSeq-count but the matrix was different when comparing with the above method. I did this because I think that after ShortStack does the distribution of the multimapping reads they became as unique and the HTSeq-count will not discard them. Am I thinking correctly?

Can you give me any hint? Thanks in advance

Diego

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/59, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXiUoit710r8WXjo7UKXw32RG2V069ks5sErG0gaJpZM4N8sEP .

DiegoZavallo commented 7 years ago

Hi Mike, Thanks for your prompt response. Let me explain myself a bit better...

What I did was to parsed the Counts.txt file from ShortStack and overlap the coordinates from each Cluster to the coordinates from each gene in arabidopsis in order to get one file with every gene ID and their counts

For example, this two clusters correspond to one gene (AT3G43990 Chr3:15782538..15785184)

Chr3:15783137-15783412 Cluster_11925 18 23.9 3.0 18 23.9 3.0 Chr3:15784468-15784680 Cluster_11926 125 133.6 6.3 125 133.6 6.3

but when I look a tab delimited file that shows all the counts that mapped in bins from my bam file, the amount of the counts are higher than the clusters

bam_NI1_4dpi_ShortStack_vs_bowtie.tab bam file compare

'chr' 'start' 'end' '4-NI1_4dpi_ShortStack.sort.bam' '8-NI1_4dpi.sort.bam'

Chr3 15782556 15782656 0.0 0.0 Chr3 15782656 15782756 0.0 0.0 Chr3 15782756 15782856 0.0 0.0 Chr3 15782856 15782956 0.0 0.0 Chr3 15782956 15783056 0.0 0.0 Chr3 15783056 15783156 1.0 1.0 Chr3 15783156 15783256 30.0 30.0 Chr3 15783256 15783356 41.0 41.0 Chr3 15783356 15783456 109.0 109.0 Chr3 15783456 15783556 6.0 6.0 Chr3 15783556 15783656 5.0 5.0 Chr3 15783656 15783756 0.0 0.0 Chr3 15783756 15783856 0.0 0.0 Chr3 15783856 15783956 0.0 0.0 Chr3 15783956 15784056 0.0 0.0 Chr3 15784056 15784156 0.0 0.0 Chr3 15784156 15784256 0.0 0.0 Chr3 15784256 15784356 0.0 0.0 Chr3 15784356 15784456 0.0 0.0 Chr3 15784456 15784556 122.0 122.0 Chr3 15784556 15784656 140.0 30033.0 Chr3 15784656 15784756 1.0 10268.0 Chr3 15784756 15784856 0.0 0.0 Chr3 15784856 15784956 0.0 0.0 Chr3 15784956 15785056 0.0 0.0 Chr3 15785056 15785156 0.0 0.0 Chr3 15785156 15785256 0.0 0.0

Pay attention only to the first column that correspond to the ShortStack mapping (the other column correspond to bowtie -a mapping). The bold coordinates correspond to the clusters, but the numbers don't match. Why do you think is that? Maybe it has to do with the settings --pad or --mincov? Are they raw counts in Counts.txt file? Do you think it's ok to grouped all the clusters to match with a gene in the differential expression analyses?

Thanks

Diego

MikeAxtell commented 7 years ago

Hello again Diego,

I'm sorry but I don't understand the data you are displaying. It does not look like output from ShortStack.

In any event, if you are interested in counting all sRNA alignments from pre-defined genomic intervals, I would use ShortStack's ''locifile" option to input a list of pre-defined intervals. Then you don't have to cobble together the de-novo discovered clusters, which may not contain all alignments to the loci of interest.

If you want to double check counts, I would suggest samtools view with the -c option.

Another thing to note is that ShortStack (like samtools view) counts a read as in the cluster if there is even minimal overlap (e.g. just one nt overlap and the read is in).

Best Wishes, Mike

On Mon, Jun 19, 2017 at 1:04 PM, DiegoZvallo notifications@github.com wrote:

Hi Mike, Thanks for your prompt response. Let me explain myself a bit better...

What I did was to parsed the Counts.txt file from ShortStack and overlap the coordinates from each Cluster to the coordinates from each gene in arabidopsis in order to get one file with every gene ID and their counts

For example, this two clusters correspond to one gene (AT3G43990 Chr3:15782538..15785184)

Chr3:15783137-15783412 Cluster_11925 18 23.9 3.0 18 23.9 3.0 Chr3:15784468-15784680 Cluster_11926 125 133.6 6.3 125 133.6 6.3

but when I look a tab delimited file that shows all the counts that mapped in bins from my bam file, the amount of the counts are higher than the clusters

bam_NI1_4dpi_ShortStack_vs_bowtie.tab bam file compare

'chr' 'start' 'end' '4-NI1_4dpi_ShortStack.sort.bam'

'8-NI1_4dpi.sort.bam' Chr3 15782556 15782656 0.0 0.0 Chr3 15782656 15782756 0.0 0.0 Chr3 15782756 15782856 0.0 0.0 Chr3 15782856 15782956 0.0 0.0 Chr3 15782956 15783056 0.0 0.0 Chr3 15783056 15783156 1.0 1.0

Chr3 15783156 15783256 30.0 30.0 Chr3 15783256 15783356 41.0 41.0 Chr3 15783356 15783456 109.0 109.0 Chr3 15783456 15783556 6.0 6.0 Chr3 15783556 15783656 5.0 5.0 Chr3 15783656 15783756 0.0 0.0 Chr3 15783756 15783856 0.0 0.0 Chr3 15783856 15783956 0.0 0.0 Chr3 15783956 15784056 0.0 0.0 Chr3 15784056 15784156 0.0 0.0 Chr3 15784156 15784256 0.0 0.0 Chr3 15784256 15784356 0.0 0.0 Chr3 15784356 15784456 0.0 0.0

Chr3 15784456 15784556 122.0 122.0 Chr3 15784556 15784656 140.0 30033.0 Chr3 15784656 15784756 1.0 10268.0 Chr3 15784756 15784856 0.0 0.0 Chr3 15784856 15784956 0.0 0.0 Chr3 15784956 15785056 0.0 0.0 Chr3 15785056 15785156 0.0 0.0 Chr3 15785156 15785256 0.0 0.0

Pay attention only to the first column that correspond to the ShortStack mapping (the other column correspond to bowtie -a mapping). The bold coordinates correspond to the clusters, but the numbers don't match. Why do you think is that? Maybe it has to do with the settings --pad or --mincov? Are they raw counts in Counts.txt file? Do you think it's ok to grouped all the clusters to match with a gene in the differential expression analyses?

Thanks

Diego

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/59#issuecomment-309503664, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXiUc2N3QcZrPW0dzGLijRSNZKNEZtks5sFqopgaJpZM4N8sEP .

DiegoZavallo commented 7 years ago

Hi Mike, I totally miss the --locifile option! I did the file as required and re-run ShortStack with this option but I still don't understand the counts differences between the bam file and the Counts.txt file. Let me show you my steps, maybe I'm missing something This is my shorstack code: ShortStack --readfile 3-NI1_4dpi.fq --genomefile TAIR10_all.fasta --bowtie_cores 2 --mismatches 0 --mmap u --nohp --locifile Araport11_genes.tab --outdir NI1_4dpi And this is a fragment of the Counts.txt file, pay attention to the bold ID, the AT3G43990 gene: .. Chr3:15774052-15775873 AT3G43960.1 6 6.0 0.0 6 6.0 0.0 Chr3:15777407-15777996 AT3G43970.1 0 NA NA 0 NA NA Chr3:15777407-15778166 AT3G43970.2 0 NA NA 0 NA NA Chr3:15778274-15779360 AT3G43980.1 2 2.0 0.0 2 2.0 0.0 Chr3:15782537-15785184 AT3G43990.1 156 157.0 8.3 156 157.0 8.3 Chr3:15796132-15796837 AT3G44006.1 111 111.0 0.0 111 111.0 0.0 Chr3:15799480-15801227 AT3G44010.1 13 13.0 0.0 13 13.0 0.0 Chr3:15799859-15801200 AT3G44010.2 11 11.0 0.0 11 11.0 0.0 Chr3:15801371-15802855 AT3G44020.1 3 3.0 0.0 3 3.0 0.0 ..

As you can see, ShortStack count 156 reads within this gene.

But when I ran samtools view on that particularly gene, it output was of 434 reads:

samtools view 4-NI1_4dpi_ShortStack.sort.bam "Chr3:15782537-15785184" > at3g43990.txt 70 uniques = XY:Z:U 364 multimapped with -u option = XY:Z:P

Here is the file: at3g43990.txt

This is consistent when I ran bowtie -m1 option and the counts in that gene are 70. Moreover, when I use HTSeq-count with the bam file from ShorStack the counts corresponding to that gene are 434, same as the bam file.

Do you see an explanation? How ShortStack count the reads within a pre-defined interval?

I'm just trying to understand, because I really want use ShortStack in my pipelines from now on

Thanks, and sorry to boder you again

Best Wishes Diego

MikeAxtell commented 7 years ago

Ah, I see.

Some versions of ShortStack's aligner output both primary alignments (the 'chosen' location for a multi-mapped read) and secondary alignments (the 'other' possible position(s) for the read that were NOT chosen). Secondary alignments are marked per the SAM spec with bit 256 in the FLAG field. ShortStack quantifies only primary alignments, not secondaries. If you run 'samtools view -F 256 -c 4-NI1_4dpi_ShortStack.sort.bam Chr3:15782537-15785184' you'll get a count of just the primary alignments. In your case, the discrepancies are from counting secondary alignments.

As of ShortStack release 3.5, reporting of secondary alignments in BAM files was discontinued. While some people liked having all of the secondary alignments stored (it allows one to go back and check on the 'decisions' about placements of multi-mapped reads), it does increase file size by quite a lot, and can also lead to confusion if folks aren't aware of the need to filter them out when using other tools that operate on BAM files.

If you upgrade to the latest ShortStack release (currently 3.8.3), the alignments it produces won't have the secondary alignments stored by default.

And if you want to process your existing BAM files to remove secondaries, you can do:

samtools view -F 256 -b [with_secondaries.bam] > [no_secondaries.bam]

Hope this helps.

On Fri, Jun 23, 2017 at 2:40 PM, DiegoZvallo notifications@github.com wrote:

Hi Mike, I totally miss the --locifile option! I did the file as required and re-run ShortStack with this option but I still don't understand the counts differences between the bam file and the Counts.txt file. Let me show you my steps, maybe I'm missing something This is my shorstack code: ShortStack --readfile 3-NI1_4dpi.fq --genomefile TAIR10_all.fasta --bowtie_cores 2 --mismatches 0 --mmap u --nohp --locifile Araport11_genes.tab --outdir NI1_4dpi And this is a fragment of the Counts.txt file, pay attention to the bold ID, the AT3G43990 gene: .. Chr3:15774052-15775873 AT3G43960.1 6 6.0 0.0 6 6.0 0.0 Chr3:15777407-15777996 AT3G43970.1 0 NA NA 0 NA NA Chr3:15777407-15778166 AT3G43970.2 0 NA NA 0 NA NA Chr3:15778274-15779360 AT3G43980.1 2 2.0 0.0 2 2.0 0.0 Chr3:15782537-15785184 AT3G43990.1 156 157.0 8.3 156 157.0 8.3 Chr3:15796132-15796837 AT3G44006.1 111 111.0 0.0 111 111.0 0.0 Chr3:15799480-15801227 AT3G44010.1 13 13.0 0.0 13 13.0 0.0 Chr3:15799859-15801200 AT3G44010.2 11 11.0 0.0 11 11.0 0.0 Chr3:15801371-15802855 AT3G44020.1 3 3.0 0.0 3 3.0 0.0 ..

As you can see, ShortStack count 156 reads within this gene.

But when I ran samtools view on that particularly gene, it output was of 434 reads:

samtools view 4-NI1_4dpi_ShortStack.sort.bam "Chr3:15782537-15785184" > at3g43990.txt 70 uniques = XY:Z:U 364 multimapped with -u option = XY:Z:P

Here is the file: at3g43990.txt

This is consistent when I ran bowtie -m1 option and the counts in that gene are 70. Moreover, when I use HTSeq-count with the bam file from ShorStack the counts corresponding to that gene are 434, same as the bam file.

Do you see an explanation? How ShortStack count the reads within a pre-defined interval?

I'm just trying to understand, because I really want use ShortStack in my pipelines from now on

Thanks, and sorry to boder you again

Best Wishes Diego

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

DiegoZavallo commented 7 years ago

Ohhh! I have 3.4 version!! I'll update to the lastest. Thank you very match, that's great!

Diego