Explanation about JACUSA output

dieterich-lab / JACUSA

JAVA framework for accurate SNV assessment

GNU General Public License v3.0

11 stars 8 forks source link

Explanation about JACUSA output #2

Closed u9090 closed 7 years ago

u9090 commented 7 years ago

Hi,

I am doing RDDs with JACUSA (working great !)

My test statistics scores range form 0.001 - 300. Is this score meaningful when working without replicates? What would be a descent/acceptable minimum (10, 100, 200)?
In the manual you mention 'base IJ columns indicate inverted base count if on negative strand’. In this case, is the vector (A,C,G,T) inverted for RNA sample (FR-FIRSTSTRAND) on minus strand as (T,G,C,A)? Is the following example correctly interpreted for the minus strand ('115' corresponds to C or G)?
```
stat    strand  bases11         bases21             DNA - A DNA - C DNA - G DNA - T RNA - A RNA - C RNA - G RNA - T
175.09  +   0,615,0,0   0,0,0,109   =>  0   615 0   0   0   0   0   109
287.89  -   399,0,0,0   0,0,115,0   =>  399 0   0   0   0   115 0   0
```
About the vcf output, does the ALT base reported is the one with the highest number of reads in samples 2 (after the REF ones)?

Thanks !

piechottam commented 7 years ago

Hi,

I am doing RDDs with JACUSA (working great !)

Okay, I assume your JACUSA call is something like this:

java jacusa.jar call-2 -a H:1 DNA.bam RNA.bam

My test statistics scores range form 0.001 - 300. Is this score meaningful when working without replicates?

What would be a descent/acceptable minimum (10, 100, 200)?

We tested JACUSA in the RDD scenario with replicates and WITHOUT replicates.

In your case you can add -T 1.15 to your jacusa call. This is an empirically derived threshold for the RDD scenario without any replicates.

Check the supplement (3.5 Derived threshold in [...]) https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1432-8 ,

if you like.

In the manual you mention 'base IJ columns indicate inverted base count if on >negative >strand’. In this case, is the vector (A,C,G,T) inverted for RNA sample on minus strand as >(T,G,C,A)? Is the following example correctly interpreted for the minus strand ('115' corresponds to >C or G)?

This depends on your library and on the employed JACUSA version:

Check 3.1.1 Strand information in the updated manual https://github.com/dieterich-lab/JACUSA/blob/master/manual/manual.pdf

Prior JACUSA 1.2:

Your JACUSA call would have an option such as: -P U,S => unstraded DNA and Stranded RNA.

If you have a lib where the first read is sequenced -> you have to invert the base counts. (you can use JacusaHelper for that.)

If you have a lib where the second read is sequenced -> you don't have to do anything.

As of version 1.2, JACUSA supports stranded paired end reads.

The format of -P has changed. The options are borrowed from tophat: FR-FIRSTSTRAND, FR-SECONDSTRAND and UNSTRANDED.

Your option could be -P UNSTRANDED,FR-FIRSTSTRAND if that corresponds to your libs. The base counts will be correctly inverted corresponding to library and strand.

Hope, that helps,

Best,

Michael

u9090 commented 7 years ago

Thanks Michael !

So, if one uses the last JACUSA version (1.2.0) and mentions the strandeness of the library with -P flag, the default output will show the properly ordered base count (A,C,G,T for both DNA and RNA samples). My stranded RNA library prep (KAPA) incorporates dUTP into the second cDNA strand, so I assume I should use -P UNSTRANDED,FR-FIRSTSTRAND.

If one outputs into VCF format (+ strand reported by default), so no need of -P flag and the base count will also be A,C,G,T for both samples.

My command:

${java_1.7.0} -jar JACUSA_v1.2.0.jar call-2 \
 --pileup-filter H:1 \
 --bed exome_interval_merged.bed \
 --result-file Sample1_JACUSA_results.vcf \
 --output-format V \
 Sample1_DNA.bam \
 Sample1_RNA.bam

Also, what is the argument JACUSA expects for the -R,--show-ref <SHOW-REF> flag? It does not seem to work for me.

piechottam commented 7 years ago

Hi,

2017-04-12 19:33 GMT+02:00 u9090 notifications@github.com:

So, if one uses the last JACUSA version (1.2.0) and mentions the strandeness of the library with -P flag, the default output will show the properly ordered base count (A,C,G,T for both DNA and RNA samples).

Yes!

If one outputs into VCF format (unstranded), so no need of -P flag and the base count will also be A,C,G,T for both samples.

Yes! Unstranded output and base count will be in a,c,g,t: VCF file header will containg the following output:

FORMAT=

My command:

${java_1.7.0} -jar JACUSA_v1.2.0.jar call-2 \ --pileup-filter H:1 \ --bed exome_interval_merged.bed \ --result-file Sample1_JACUSA_results.vcf \ --output-format V \ Sample1_DNA.bam \ Sample1_RNA.bam

I would add

--pileup-filter H:1,D,Y

This will mark and put possible artefacts to .filtered

H Sample 1 - your DNA is not homozygous - at this specific site D filter variants in the vicinity of Read Start/End, Intron, and INDEL positionY filter variant calls in the vicinity of homopolymers

My stranded RNA library prep (KAPA) incorporates dUTP into the second cDNA strand, so I assume I should use -P UNSTRANDED,FR-FIRSTSTRAND.

I don't know KAPA - sorry. But your -P looks good to me according to... I quote from: http://ccb.jhu.edu/software/tophat/manual.shtml

Library Type Examples Description fr-unstranded Standard Illumina Reads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand. fr-firststrand dUTP, NSR, NNSR Same as above except we enforce the rule that the right-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during first strand synthesis is sequenced. fr-secondstrand Ligation, Standard SOLiD Same as above except we enforce the rule that the left-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during second strand synthesis is sequenced

Best, Michael

bostanict commented 3 years ago

Hi,

I have a simple question when the DNA and RNA samples are both unstranded, does JACUSA automatically invert the base for those RNA reads that map on the reverse strand? So if a site with reads on forward mapping has C=>T and another site with reads mapping on reverse strand have T=>C, does JACUSA automatically put both as C=>T since the other one on reverse strand or no? If not, what is the way to check the reads strand site was called in the output since it is a very important issue.

Thanks in advance

piechottam commented 3 years ago

If both (DNA and RNA: "-P UNSTRANDED,UNSTRANDED") are unstranded, JACUSA will:

not change orientation,
nocht adjust base columns and
strand column will be "."

Run your analysis with "-P UNSTRANDED,FR-FIRSTSTRAND" then JACUSA:

WILL invert bases according to library type for DNA and RNA -> strrand column will be "+" or "-".

MAKE sure to choose the correct library type for your RNA sample:

FR-FIRSTSTRAND or
RF-SECONDSTRAND.

Best, Michael

Am 06.03.2021 02:14 schrieb bostanict:

Hi,

I have a simple question when the DNA and RNA samples are both unstranded, does JACUSA automatically invert the base for those RNA reads that map on the reverse strand? So if a site with reads on forward mapping has C=>T and another site with reads mapping on reverse strand have T=>C, does JACUSA automatically put both as C=>T since the other one on reverse strand or no? If not, what is the way to check the reads strand site was called in the output since it is a very important issue.

Thanks in advance

-- You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].

Links:

[1] https://github.com/dieterich-lab/JACUSA/issues/2#issuecomment-791825070 [2] https://github.com/notifications/unsubscribe-auth/ABYLFM7GCIENCPW4S7IX3VDTCF6YFANCNFSM4DHKFHWQ

bostanict commented 3 years ago

Hi Michael,

Thanks for the response, the issue is that our RNA sample is unstranded. So are you recommending that even if the RNA sample is stranded, we still put "-P UNSTRANDED,FR-FIRSTSTRAND" and JACUSA will report correctly?

Thank you

piechottam commented 3 years ago

Hi,

if your RNA-lib is unstranded then use -P UNSTRANDED,UNSTRANDED and derive the correct orientation from an annotation file (GTF,GFF,BED) with bedtools.

Best, Michael

Am 07.03.2021 18:04 schrieb bostanict:

Hi Michael,

Thanks for the response, the issue is that our RNA sample is unstranded. So are you recommending that even if the RNA sample is stranded, we still put "-P UNSTRANDED,FR-FIRSTSTRAND" and JACUSA will report correctly?

Thank you

-- You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].

Links:

[1] https://github.com/dieterich-lab/JACUSA/issues/2#issuecomment-792311754 [2] https://github.com/notifications/unsubscribe-auth/ABYLFM7FS7JJL4IUIF2HX53TCOW2RANCNFSM4DHKFHWQ

bostanict commented 3 years ago

Hi Michael,

Thanks a lot for the reply. So here are two other questions:

1- If there are reads pileup on a site that some of them are mapped on forward and some of them are mapped on the reverse strand, how JACUSA call and count the coverage, does it call the site and report the total count based on both or it will consider the one with the highest coverage or we need to use strand bias option, etc. Any clarification on this appreciated.

2- sorry if this is a too basic question, I am new to this. When you say use bed tools, you mean to check the overlapping gene and its strand to decide or use the bedtools to extract the reads of that region and check their alignment strand? because the first case (using genes) still does not answer the question of called site strand. If you mean the second one, could you please clarify how I can do this in case my question 1 happens and there are reads mapped on both strands?

Thank you very very much in advance,

Hamed

piechottam commented 3 years ago

Hi Hamed,

(sry for the delayed reply)

1- If there are reads pileup on a site that some of them are mapped on forward and some of them are mapped on the reverse strand, how JACUSA call and count the coverage, does it call the site and report the total count based on both or it will consider the one with the highest coverage or we need to use strand bias option, etc. Any clarification on this appreciated. It depends on the library type (CLI option: "-P") UNSTRANDED -> sum of forward and reverse reported (strand column: ".") FR-FIRSTRAND or RF-SECONSTRAND -> coverage for forward AND reverse reported separately (strand column: "+" or "-") You would have to implement a strand bias option on your own.

2- sorry if this is a too basic question, I am new to this. When you say use bed tools, you mean to check the overlapping gene and its strand to decide or use the bedtools to extract the reads of that region and check their alignment strand? because the first case (using genes) still does not answer the question of called site strand. If you mean the second one, could you please clarify how I can do this in case my question 1 happens and there are reads mapped on both strands? If your RNA library is unstranded I am not aware of any method to answer the question of called site strand.

Check: https://open.oregonstate.education/app/uploads/sites/69/2019/10/libraryTypes-1.jpg Top part of image has a transcript on "+" strand. Bottom part represents a schematic view of sequenced reads (fr-unstranded). Reads are mapped to both strands "+","-". From this data you cannot deduce the original strand of the transcript. You need a stranded library comare image fr-firststrand or fr-secondstrand.

You can use bedtools to infer orientation of the reads (pseudocode): Files:

JACUSA.out (lets say it has 12 columns)
Annotation.bed/gtf/gff (lets say it has 12 columns and the strand is column 6)

bedtools intersect -a JACUSA.out -b Annotation.bed/gtf/gff -wao > merged.txt Use your favorite Tool to extract column 1-12 from merged and column 12+6 ("gene orientation"). Your output will look like the original JACUSA.out and the gene orientation will be added. Load R or python and invert bases when gene orientation is "-".

Hope that helps.