bartongroup / RATS

Relative Abundance of Transcripts: An R package for the detection of Differential Transcript isoform Usage.
MIT License
32 stars 1 forks source link

RATS - general questions, and how to account for sample effects? #11

Closed miyakokodama closed 7 years ago

miyakokodama commented 7 years ago

Hi,

First of all, thanks so much for developing this wonderful tool! It seems to do exactly what I need to do, I like how easy it is to visualize the results, and I am very happy with my results so far.

That being said, I am very new to RATS – I have gone through the tutorial and am hoping if you could help me understand how this package works:

My data: I have two conditions (2 cell types – phloem and xylem), along with their biological sample ID. (heads up – I have very strong sample effects in my data). I have performed isoform quantification using Kallisto.

This is how my “sample-to-condition table (s2c)” looks like.

s2c

Sample_ID Sample_Number Cell_Type 1P 1 Phloem 1X 1 Xylem 2P 2 Phloem 2X 2 Xylem

Using Sleuth, I made a sleuth object.

my_sleuth_object_1 <- sleuth_prep(s2c, ~ Cell_Type )

Then, I used call_DTU fuction to call DTU.

mydtu_1 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A=" Phloem", name_B="Xylem", varname="Cell_Type")

QUESTION 1:

I have strong sample effects in my data set, so I’d like to account for these effects when calling for DTU. On the tutorial, it says “Please note that currently only one covariate is used in each run. therefore, any comparison you may wish to make must be encoded as a single column in myslo$sample_to_covariates, by modifying the table if necessary.” Does this mean,

my_sleuth_object_1 <- sleuth_prep(s2c, ~ Cell_Type)

mydtu_1 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A="Phloem", name_B="Xylem", varname="Cell_Type")

my_sleuth_object_2 <- sleuth_prep(s2c, ~ Cell_Type + Sample_Number)

mydtu_2 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_2, name_A="Phloem", name_B="Xylem", varname="Cell_Type")

These two would be the same? I run both and compared – the results are different, and mydtu_1 has ~30 more DTU genes than mydtu_2.

QUESTION 2:

Assuming that I can only account for one covariate, how do I account for sample effects in my data? According to the tutorial under “Comparing by different variables (such as batch effects)”, do I perform:

my_sleuth_object_1 <- sleuth_prep(s2c, ~ Cell_Type)

mydtu_1 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A="Phloem", name_B="Xylem", varname="Cell_Type")

my_sleuth_object_2 <- sleuth_prep(s2c, ~ Sample_Number)

mydtu_2 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_2, name_A="1", name_B="2", varname=" Sample_Number ")

And then compare the results between mydtu_1 and mydtu_2?

QUESTION 3:

Finally, I have another data set, which looks like this:

Sample_ID Sample_Number Cell_Type 3P 3 Phloem 3X 3 Xylem 4P 4 Phloem 4X 4 Xylem 5P 5 Phloem 5X 5 Xylem 6P 6 Phloem 6X 6 Xylem

Can RATS account for sample effects in the data like this, as there are more than two variables in the Sample_Number (4 samples in total)?

I would appreciate any help you could provide, and many thanks for reading this.

Miyako

fruce-ki commented 7 years ago

Hello Miyako,

Thank you for taking an interest in Rats!

First of all, the DTE results by Sleuth are not used in Rats at all. So the parameters with which you run Sleuth are completely irrelevant. You don't need to re-run Sleuth to change how you run Rats. Rats only needs the bootstrapped Kallisto quantifications and the sample grouping (essentially Sleuth's input), both of which are conveniently already stored in the Sleuth output object.

Secondly, Sleuth (and Rats) use the term "sample number" to refer to row number in the sample_to_covariate (s2c) table. Each row is a separate sequencing sample (replicate/run/lane). In your examples you have a covariate column called Sample_Number. This should be fine for the tool but makes things confusing for us humans, since the same name refers to two different things. I'm guessing that the column refers to the biological specimen where the RNA came from rather than the sequencing sample?

Now on to the question(s).

Rats does a straight-forward comparison of any two categories from any single column of that table. If a column has more than 2 categories in it, only two of the categories can be compared each time. This is because Rats does not search for batch effects or hidden/unknown covariates. Since you know you have batch effects in your data, you can add them to your s2c as covariate columns.

Here are some examples that hopefully demonstrate how to add known effects and how to combine covariates.

Let's say you have the following s2c:

    Sample_ID  Specimen  Tissue
1  1P           A          Phloem
2  1X           A           Xylem
3  2P           B          Phloem
4  2X           B           Xylem
5  3P           C           Phloem
6  3X           C           Xylem

To compare all Phloem against all Xylem you'd normally run:

call_DTU(annot=my_identifiers_table, slo=my_sleuth_object, name_A="Xylem", name_B="Phloem", varname=" Tissue")

Now let's say you have a batch effect. For example samples 1P, 2P, 2X and 3X were prepared by a different person. You need a new s2c column for that:

    Sample_ID  Specimen  Tissue     Person
1  1P           A          Phloem      Joe
2  1X           A           Xylem      Ben
3  2P           B          Phloem      Joe
4  2X           B           Xylem      Joe
5  3P           C           Phloem     Ben
6  3X           C           Xylem      Joe

You can compare all of Joe's samples against all of Ben's:

call_DTU(annot=my_identifiers_table, slo=my_sleuth_object, name_A="Joe", name_B="Ben", varname="Person")

To combine covariates you create a new one. For example to compare phloem prepared by Joe (1P and 2P) against phloem prepared by Ben (3P):

    Sample_ID  Specimen  Tissue     Person  Tiss_Per
1  1P           A          Phloem      Joe    PJ
2  1X           A           Xylem      Ben    XB
3  2P           B          Phloem      Joe    PJ
4  2X           B           Xylem      Joe    XJ
5  3P           C           Phloem     Ben    PB
6  3X           C           Xylem      Joe    XJ
call_DTU(annot=my_identifiers_table, slo=my_sleuth_object, name_A="PJ", name_B="PB", varname="Tiss_Per")

I hope this helps!

-Kimon

fruce-ki commented 7 years ago

Regarding the reason you got different number of DTU genes in Question 1, this is probably down to statistical noise. As I said, Rats only uses the input data of Sleuth, so both runs should be using exactly the same data, even if the Sleuth models are different. What happens is that Rats randomly samples the kallisto quantification bootstraps, in order to asses how uncertainty in the quantifications affects the DTU calls. The random sampling is different in each run, meaning that some genes near the Rats confidence threshold can end up on different sides in different runs. I suspect that if you look at the boot_dtu_freq column in the Genes table for these 30 genes, this is what's happening. In fact, the total number of genes affected may be greater than 30, 30 is just the net difference between genes gained and genes lost.

A way to reduce the number of genes affected by this is to run more bootstrap iterations for Rats (and ideally also for Kallisto).

miyakokodama commented 7 years ago

Hi Kimon,

Thanks so much for getting back to me so quickly - your explanations are really helpful! According to your second comment, I definitely plan on running more bootstrap iterations (I used -b 200 for Kallisto, and 100 for RATS).

If I understood you correctly, let’s say I have the following s2cobject:

   Sample_ID    Specimen   Tissue   Specimen_Tissue_Combined   
1  1P           A          Phloem    A_Phloem   
2  1X           A           Xylem    A_Xylem
3  2P           B          Phloem    B_Phloem   
4  2X           B           Xylem    B_Xylem   

And if I want to account for the sample effects, do I have to take the following steps:


my_sleuth_object_1 <- sleuth_prep(s2c, ~ Cell_Type)

# Phloem vs xylem comparison
mydtu_1 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A="Phloem", name_B="Xylem", varname="Cell_Type")

# A_Phloem vs B_Phloem comparison
mydtu_2 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A="A_Phloem", name_B="B_Phloem", varname="Specimen_Tissue_Combined")

# A_Xylem vs B_Xylem comparison
mydtu_3 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A="A_Xylem", name_B="B_Xylem", varname="Specimen_Tissue_Combined")

Then take DTUs that are unique to mydtu_1, by comparing the results from mydtu_1, mydut_2 and mydtu_3?

Also, if I have an s2c object that looks like this:

   Sample_ID    Specimen   Tissue   Specimen_Tissue_Combined   
1  1P           A          Phloem    A_Phloem   
2  1X           A           Xylem    A_Xylem
3  2P           B          Phloem    B_Phloem   
4  2X           B           Xylem    B_Xylem
5  3P           C          Phloem    C_Phloem
6  3X           C           Xylem    C_Xylem
7  4P           D          Phloem    D_Phloem
8  4X           D           Xylem    D_Xylem  

In order to account for the sample effects, do I perform pairwise comparison for each possible pair and take DTUs unique to the phloem vs xylem comparison? (as I can only specify two conditions, by name_A= and name_B=)

I apologize if my questions are redundant, but I wanted to make sure I was taking each step right :)

Many thanks for any help and suggestions you could provide.

Miyako

fruce-ki commented 7 years ago

Hi Miyako

Yes to both questions. But it seems that a multivariate analysis of the read counts is better suited for what you are trying to do.

Bear in mind that in breaking down your groups, you are reducing the statistical power of your analysis. From having 4 replicates of phloem and xylem you then have just 1. This alone will cause poorer p-values for all the genes, so you will probably get much fewer DTU genes for this reason alone, regardless of any real biological differences. This is unavoidable because it is simply how statistics work: more data means more confidence. On the bright side, any genes that do show up as DTU between samples are definitely genes to be careful about in interpreting your tissue comparison.

It has just occurred to me that I probably understand why you think you have strong batch effects. You looked at some of your DTU genes with plot_gene() and saw that the samples were inconsistent. Is that correct? We've seen this a lot. If that is the case, one thing to do is to raise the count threshold, as low coverage genes are noisier. Another thing to be careful about is that isoform quantification is influenced a lot by the completeness and accuracy of the annotation as well as the complexity and similarity of the isoform models. So instead of trying to compare your samples to one another with complex statistical analyses, I would first load the read alignments per sample into a genome browser and decide if the quantifications of the isoforms for these genes make sense.

miyakokodama commented 7 years ago

Hi Kimon,

I apologize it took me a while to get back to you. And thanks for your suggestions - they make really good sense. Also, yes, you are absolutely right, I did run plot_gene and saw that samples were inconsistent (I also run sleuth_live(so) using sleuth based on the same isoform counts generated by Kallisto, and did PCA, which show strong sample effects).

As suggested by your tutorial, I started using more stringent cutoffs (not only count threshold but p_thresh and dprop_thresh as well; I may also increase the count_thresh to 20 to see if it makes a difference).

mydtu_strict <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A="Phloem", name_B="Xylem", varname="Cell_Type", p_thresh = 0.01, count_thresh = 10, dprop_thresh = 0.25)

I also plan on looking the alignments too as you suggested.

Many thanks for taking the time to reply me – it has been really helpful!

Miyako

fruce-ki commented 7 years ago

Glad I could help. Good luck with your research!

-Kimon

fifdick commented 6 years ago

since support for sleuth import has been removed, I was wondering how or if it is still possible to add/combine 2 covariates to the model as described in the comments above, to model (correct for) known batch effects (when using directly salmon or kallisto input where there is no variable name column and the inputs are just given in two separate groups (bootdata_A and bootdata_B) to the call_dtu function). I see that the parameter "varname" is still in the call_dtu function but probably just refers to variable name of the 2 conditions being tested?

fruce-ki commented 6 years ago

Hello fifdick,

RATs has always only done pair-wise comparisons of two sets of measurements as provided, regardless of the mode of import of the data. It has never done multivariate modelling and batch effect correction. Any corrections applied by Sleuth in its own models would not be transferred to RATs. The "support" was simply that Sleuth used to store the raw input data, which made it conveniently accessible in table format.

Sample-wide factors can be applied at either the fish4rodents() step or the call_DTU() step, using the relatively new scaleto and scaling parameters, respectively. For DTU it is not necessary for the abundances to be scaled to the same library size, and it's probably best not to. Complex batch effects, affecting the expression of particular genes/transcripts differently than others, are not possible to correct from within RATs, neither at present nor in the past, but the rboot parameter should help identify genes where the isoform abundance profile is very different among samples.

The parameters varname, name_A and name_B have been kept purely as documenting metadata to be recorded in the output object. They have no purpose within RATs.

fifdick commented 6 years ago

perfect thanks for the answer. I think I just go confused by the parameters

varname, name_A and name_B