Closed miyakokodama closed 8 years ago
Hello Miyako,
Thank you for taking an interest in Rats!
First of all, the DTE results by Sleuth are not used in Rats at all. So the parameters with which you run Sleuth are completely irrelevant. You don't need to re-run Sleuth to change how you run Rats. Rats only needs the bootstrapped Kallisto quantifications and the sample grouping (essentially Sleuth's input), both of which are conveniently already stored in the Sleuth output object.
Secondly, Sleuth (and Rats) use the term "sample number" to refer to row number in the sample_to_covariate
(s2c) table. Each row is a separate sequencing sample (replicate/run/lane). In your examples you have a covariate column called Sample_Number
. This should be fine for the tool but makes things confusing for us humans, since the same name refers to two different things. I'm guessing that the column refers to the biological specimen where the RNA came from rather than the sequencing sample?
Now on to the question(s).
Rats does a straight-forward comparison of any two categories from any single column of that table. If a column has more than 2 categories in it, only two of the categories can be compared each time. This is because Rats does not search for batch effects or hidden/unknown covariates.
Since you know you have batch effects in your data, you can add them to your s2c
as covariate columns.
Here are some examples that hopefully demonstrate how to add known effects and how to combine covariates.
Let's say you have the following s2c
:
Sample_ID Specimen Tissue
1 1P A Phloem
2 1X A Xylem
3 2P B Phloem
4 2X B Xylem
5 3P C Phloem
6 3X C Xylem
To compare all Phloem against all Xylem you'd normally run:
call_DTU(annot=my_identifiers_table, slo=my_sleuth_object, name_A="Xylem", name_B="Phloem", varname=" Tissue")
Now let's say you have a batch effect. For example samples 1P, 2P, 2X and 3X were prepared by a different person. You need a new s2c
column for that:
Sample_ID Specimen Tissue Person
1 1P A Phloem Joe
2 1X A Xylem Ben
3 2P B Phloem Joe
4 2X B Xylem Joe
5 3P C Phloem Ben
6 3X C Xylem Joe
You can compare all of Joe's samples against all of Ben's:
call_DTU(annot=my_identifiers_table, slo=my_sleuth_object, name_A="Joe", name_B="Ben", varname="Person")
To combine covariates you create a new one. For example to compare phloem prepared by Joe (1P and 2P) against phloem prepared by Ben (3P):
Sample_ID Specimen Tissue Person Tiss_Per
1 1P A Phloem Joe PJ
2 1X A Xylem Ben XB
3 2P B Phloem Joe PJ
4 2X B Xylem Joe XJ
5 3P C Phloem Ben PB
6 3X C Xylem Joe XJ
call_DTU(annot=my_identifiers_table, slo=my_sleuth_object, name_A="PJ", name_B="PB", varname="Tiss_Per")
I hope this helps!
-Kimon
Regarding the reason you got different number of DTU genes in Question 1, this is probably down to statistical noise.
As I said, Rats only uses the input data of Sleuth, so both runs should be using exactly the same data, even if the Sleuth models are different. What happens is that Rats randomly samples the kallisto quantification bootstraps, in order to asses how uncertainty in the quantifications affects the DTU calls. The random sampling is different in each run, meaning that some genes near the Rats confidence threshold can end up on different sides in different runs. I suspect that if you look at the boot_dtu_freq
column in the Genes
table for these 30 genes, this is what's happening. In fact, the total number of genes affected may be greater than 30, 30 is just the net difference between genes gained and genes lost.
A way to reduce the number of genes affected by this is to run more bootstrap iterations for Rats (and ideally also for Kallisto).
Hi Kimon,
Thanks so much for getting back to me so quickly - your explanations are really helpful! According to your second comment, I definitely plan on running more bootstrap iterations (I used -b 200 for Kallisto, and 100 for RATS).
If I understood you correctly, let’s say I have the following s2c
object:
Sample_ID Specimen Tissue Specimen_Tissue_Combined
1 1P A Phloem A_Phloem
2 1X A Xylem A_Xylem
3 2P B Phloem B_Phloem
4 2X B Xylem B_Xylem
And if I want to account for the sample effects, do I have to take the following steps:
my_sleuth_object_1 <- sleuth_prep(s2c, ~ Cell_Type)
# Phloem vs xylem comparison
mydtu_1 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A="Phloem", name_B="Xylem", varname="Cell_Type")
# A_Phloem vs B_Phloem comparison
mydtu_2 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A="A_Phloem", name_B="B_Phloem", varname="Specimen_Tissue_Combined")
# A_Xylem vs B_Xylem comparison
mydtu_3 <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A="A_Xylem", name_B="B_Xylem", varname="Specimen_Tissue_Combined")
Then take DTUs that are unique to mydtu_1, by comparing the results from mydtu_1, mydut_2 and mydtu_3?
Also, if I have an s2c
object that looks like this:
Sample_ID Specimen Tissue Specimen_Tissue_Combined
1 1P A Phloem A_Phloem
2 1X A Xylem A_Xylem
3 2P B Phloem B_Phloem
4 2X B Xylem B_Xylem
5 3P C Phloem C_Phloem
6 3X C Xylem C_Xylem
7 4P D Phloem D_Phloem
8 4X D Xylem D_Xylem
In order to account for the sample effects, do I perform pairwise comparison for each possible pair and take DTUs unique to the phloem vs xylem comparison? (as I can only specify two conditions, by name_A= and name_B=
)
I apologize if my questions are redundant, but I wanted to make sure I was taking each step right :)
Many thanks for any help and suggestions you could provide.
Miyako
Hi Miyako
Yes to both questions. But it seems that a multivariate analysis of the read counts is better suited for what you are trying to do.
Bear in mind that in breaking down your groups, you are reducing the statistical power of your analysis. From having 4 replicates of phloem and xylem you then have just 1. This alone will cause poorer p-values for all the genes, so you will probably get much fewer DTU genes for this reason alone, regardless of any real biological differences. This is unavoidable because it is simply how statistics work: more data means more confidence. On the bright side, any genes that do show up as DTU between samples are definitely genes to be careful about in interpreting your tissue comparison.
It has just occurred to me that I probably understand why you think you have strong batch effects. You looked at some of your DTU genes with plot_gene() and saw that the samples were inconsistent. Is that correct? We've seen this a lot. If that is the case, one thing to do is to raise the count threshold, as low coverage genes are noisier. Another thing to be careful about is that isoform quantification is influenced a lot by the completeness and accuracy of the annotation as well as the complexity and similarity of the isoform models. So instead of trying to compare your samples to one another with complex statistical analyses, I would first load the read alignments per sample into a genome browser and decide if the quantifications of the isoforms for these genes make sense.
Hi Kimon,
I apologize it took me a while to get back to you. And thanks for your suggestions - they make really good sense. Also, yes, you are absolutely right, I did run plot_gene and saw that samples were inconsistent (I also run sleuth_live(so) using sleuth based on the same isoform counts generated by Kallisto, and did PCA, which show strong sample effects).
As suggested by your tutorial, I started using more stringent cutoffs (not only count threshold but p_thresh
and dprop_thresh
as well; I may also increase the count_thresh to 20 to see if it makes a difference).
mydtu_strict <- call_DTU(annot=my_identifiers_table, slo=my_sleuth_object_1, name_A="Phloem", name_B="Xylem", varname="Cell_Type", p_thresh = 0.01, count_thresh = 10, dprop_thresh = 0.25)
I also plan on looking the alignments too as you suggested.
Many thanks for taking the time to reply me – it has been really helpful!
Miyako
Glad I could help. Good luck with your research!
-Kimon
since support for sleuth import has been removed, I was wondering how or if it is still possible to add/combine 2 covariates to the model as described in the comments above, to model (correct for) known batch effects (when using directly salmon or kallisto input where there is no variable name column and the inputs are just given in two separate groups (bootdata_A and bootdata_B) to the call_dtu function). I see that the parameter "varname" is still in the call_dtu function but probably just refers to variable name of the 2 conditions being tested?
Hello fifdick,
RATs has always only done pair-wise comparisons of two sets of measurements as provided, regardless of the mode of import of the data. It has never done multivariate modelling and batch effect correction. Any corrections applied by Sleuth in its own models would not be transferred to RATs. The "support" was simply that Sleuth used to store the raw input data, which made it conveniently accessible in table format.
Sample-wide factors can be applied at either the fish4rodents()
step or the call_DTU()
step, using the relatively new scaleto
and scaling
parameters, respectively. For DTU it is not necessary for the abundances to be scaled to the same library size, and it's probably best not to.
Complex batch effects, affecting the expression of particular genes/transcripts differently than others, are not possible to correct from within RATs, neither at present nor in the past, but the rboot
parameter should help identify genes where the isoform abundance profile is very different among samples.
The parameters varname
, name_A
and name_B
have been kept purely as documenting metadata to be recorded in the output object. They have no purpose within RATs.
perfect thanks for the answer. I think I just go confused by the parameters
varname, name_A and name_B
Hi,
First of all, thanks so much for developing this wonderful tool! It seems to do exactly what I need to do, I like how easy it is to visualize the results, and I am very happy with my results so far.
That being said, I am very new to RATS – I have gone through the tutorial and am hoping if you could help me understand how this package works:
My data: I have two conditions (2 cell types – phloem and xylem), along with their biological sample ID. (heads up – I have very strong sample effects in my data). I have performed isoform quantification using Kallisto.
This is how my “sample-to-condition table (s2c)” looks like.
Sample_ID Sample_Number Cell_Type 1P 1 Phloem 1X 1 Xylem 2P 2 Phloem 2X 2 Xylem
Using Sleuth, I made a sleuth object.
Then, I used call_DTU fuction to call DTU.
QUESTION 1:
I have strong sample effects in my data set, so I’d like to account for these effects when calling for DTU. On the tutorial, it says “Please note that currently only one covariate is used in each run. therefore, any comparison you may wish to make must be encoded as a single column in myslo$sample_to_covariates, by modifying the table if necessary.” Does this mean,
These two would be the same? I run both and compared – the results are different, and mydtu_1 has ~30 more DTU genes than mydtu_2.
QUESTION 2:
Assuming that I can only account for one covariate, how do I account for sample effects in my data? According to the tutorial under “Comparing by different variables (such as batch effects)”, do I perform:
And then compare the results between mydtu_1 and mydtu_2?
QUESTION 3:
Finally, I have another data set, which looks like this:
Sample_ID Sample_Number Cell_Type 3P 3 Phloem 3X 3 Xylem 4P 4 Phloem 4X 4 Xylem 5P 5 Phloem 5X 5 Xylem 6P 6 Phloem 6X 6 Xylem
Can RATS account for sample effects in the data like this, as there are more than two variables in the Sample_Number (4 samples in total)?
I would appreciate any help you could provide, and many thanks for reading this.
Miyako