BostonGene / MFP

Mollecular Functional Portraits
Other
33 stars 13 forks source link

format of expression matrix #5

Open joan-yanqiong opened 2 years ago

joan-yanqiong commented 2 years ago

Hi,

I am trying to compute the gsea scores, using the following (similar to the given example code).

Read signatures gmt = read_gene_sets('./signatures/gene_signatures.gmt') # GMT format like in MSIGdb

Read expressions counts = pd.read_csv("../../Data/RNAseq/TCGA_tpm_LUAD.txt", sep="\t") counts_transformed = np.log2(counts + 1)

Calc signature scores signature_scores = ssgsea_formula(counts_transformed, gmt)

Scale signatures signature_scores = median_scale(signature_scores)

Should the counts matrix (dataframe) be in the following format: rows = genes and columns = samples? Because if I do that, the ssgsea_scores() function does not work.

This is from the ssgsea_formula() function: ranks = data.T.rank(method=rank_method, na_option='bottom')

  1. data` -> rows = genes, columns = samples
  2. data.T -> rows = samples, columns = genes
  3. data.T.rank -> ranks.index = samples as rank(index=0) by default.

So is it correct to say that you need to use as input for ssgsea_formula() the counts_transformed with samples = rows and columns = genes (or of course remove the '.T' in the ssgsea_formula() itself?

versions of the packages I'm using: pandas==1.4.2 numpy==1.22.3