Matrix with non-existent genes

ktroule commented 6 years ago

Hi

I've question, whenever xCell performs the ssGSEA analysis, it only uses the ~10,000 genes from the immune signatures, right? So, no problem if the input matrix is a 60,000 transcript or just 20,000 protein coding genes, right?

On the other hand, when you perform the ssGSEA using GSVA, do you account for the maximum and minimum peak of the enrichment? From the manuscript I understand that the genes from signature are differentially expressed ¿both directions, up/down? How do you account for signatures having genes up and down regulated?

And finally, I've not been able to find what exactly the scale option does.

Thanks

dviraran commented 6 years ago

Hi,

I'll try to answer -

I've question, whenever xCell performs the ssGSEA analysis, it only uses the ~10,000 genes from the immune signatures, right? So, no problem if the input matrix is a 60,000 transcript or just 20,000 protein coding genes, right?

Exactly. I chose those genes because they were found in all the 6 data sources I used, but still, be aware of how many of the ten thousand you actually have in your data.

On the other hand, when you perform the ssGSEA using GSVA, do you account for the maximum and minimum peak of the enrichment?

I account for the minimum of the enrichment. Maximum peak is problematic. I wrote a little about it in the manuscript - ssGSEA, at least in RNA-seq data, tends to grow exponentially. The variation in high scores across samples is super high. This is why we use 25% mixtures in the transformation and spillover compensation (the range where it is still close to linear).

Accounting for the minimum is also problematic. Since there are major differences in the minimum among datasets, the minimal xCell score may be misleading if not using a heterogenous enough dataset. It is recommended to use xCell with all available samples, and not just subsets.

From the manuscript I understand that the genes from signature are differentially expressed ¿both directions, up/down? How do you account for signatures having genes up and downregulated?

No, only upregulated genes are in the signatures.

And finally, I've not been able to find what exactly the scale option does.

The scaling is an approximation parameter I learn from the data in an attempt to scale the scores to fractions. Because its required for the spillover compensation, its best to set it to TRUE (the deafault). I see that its missing from the documentation. I will add it soon.

Hope this helps.

Best, Dvir

ktroule commented 6 years ago

Thanks for the explanation.

ktroule commented 6 years ago

Hi

Would it be possible to know how you obtained the values of the precalculated scores file for the ~9000 TCGA samples?

I've done it myself using the xCell R library (I'm interested also in normal tissues, ~11000 samples in total). I observed that in your file the maximum score value is about 5 and that you have no 0. In my case I have plenty of 0 (about 17% os samples) and the highest score is around the 2.7.

I understand that adding samples will change in a way the scores, but I wouldn't expect those changes.

Just in case I did something wrong:

Downloaded the raw count.
Translate counts to TPM (no other normalization was performed)
Converted Ensemble id to gene id
Remove genes that are not protein coding (using Hugo db)
Use the xCell library

tcga.cds.scores <- xCellAnalysis(tcga.counts.cds, rnaseq = T, parallel.sz = 20)

Thanks

dviraran commented 6 years ago

Hi,

Sure. First, I used the RSEM data from Xena, not the TPM data (this is what was available back then) - https://xenabrowser.net/datapages/?cohort=TCGA%20TARGET%20GTEx.

I then removed the GTEx samples and non-primary tumors, and ran the pipeline with default parameters for all the rest.

Hope this helps.

Best, Dvir

ktroule commented 6 years ago

Would you expect such changes in the scores because of the use of TPM rather than RSEM? I'm not worried about the max values but I'm worried about having so many 0 values.

dviraran commented 6 years ago

Not really. Eventually, I would look at the correlation between the scores.

Regarding the zeros - I guess the reason is simply that I didn't run this analysis with the latest xCell version. In one of the last commits, I added rounding of the scores to 5 digits after the decimal place.

ktroule commented 6 years ago

Thanks. I'm on the way to look at the correlation.

Thanks for your time.

ktroule commented 6 years ago

Just did a pretty simple correlation, for most seems to be ok, but there are a few not so good.

corr.scores.pdf

dviraran commented 6 years ago

Very weird. Can you upload your scores? I'm trying to understand whats going on here.

I reran the github version of xCell with the RSEM data and getting the same results as in the paper.

ktroule commented 6 years ago

Sure, I downloaded the data using TCGAbiolinks. I used HTSeq count, not sure if that could be a problem source.

query <- GDCquery(project = tcga.projects, data.category = 'Transcriptome Profiling', 
                  workflow.type = 'HTSeq - Counts' ,legacy = FALSE)

TPM conversion + xCell Script https://drive.google.com/open?id=1_ETI91lOlZ3yr8lP0O6IUQ9uab-Qxr8U Scores matrix (RDS) https://drive.google.com/open?id=1kILnBw47i4dO8Hy4qvlYF8Xi4kGGIaeO

dviraran commented 6 years ago

Hi,

I downloaded your xCell scores and compared them to those published. As I suspected, the reason for the really bad correlations is just because the cell types in both files are ordered differently. For example in your data B-cells is in the 5th row, while in the paper data its in 4th row. This fixes most of the discrepancies (see below).

I am not surprised that the correlations are not 100%. The data you were using has been processed in different methods from the raw data than the one I used. Several papers have shown already that this has substantial effects on the TCGA gene expression analysis. I still believe that my approach of using the rankings instead of the actual values is more 'immune' to these differences than deconvolution methods that require the actual gene expression values.

Still, it is surprising that there are yet some cell types with low or no correlations (Tgd, CD4+ Tem and osteoblasts). I am guessing that those are results of the spillover correction, but it needs further investigation. I don't know which of the scores is more reliable.

ktroule commented 6 years ago

Thanks. I did not notice that matrix had different orders, something that I should have checked .

Thanks for this explanation, I'm a little bit worried about the high correlation between the macrophages, it's impossible to tell whether its happen by biological or technical reasons (ideally only the first).

I've also noticed weird scores Tgd.

Because of the use I'm giving to theses scores I can tell you that it looks a little bit better the scores from the RSEM rather than the ones from HTSEq. But I still need more work to be sure fo this.

Once more, thanks for your time.

dviraran / xCell

Matrix with non-existent genes #11