bhklab / MetaGxBreast

Other
5 stars 2 forks source link

row names in phenotype data #9

Open Nairobi-2020 opened 1 year ago

Nairobi-2020 commented 1 year ago

after loading TCGA data, I found that row.names(phenotype) is different from the columns "sample_name" "alt_sample_name". the row.names(phenotype) is identical with colnames(expr). I tend to believe that the programmer initially got error message since row.names(phenotype) != colnames(expr), and then forced them to be equal by assigning: row.names(phenotype) = colnames(expr)

Then, this is only to avoid error message when creating eset object, and row.names(phenotype) is incorrect and we should ignore.

An I correct?

Nairobi-2020 commented 1 year ago

I found that 5 datasets in the package has the sample name problems. [1] "DUKE: FALSE FALSE TRUE" [1] "EXPO: FALSE FALSE TRUE" [1] "STNO2: FALSE FALSE TRUE" [1] "TCGA: FALSE FALSE TRUE" [1] "TRANSBIG: FALSE FALSE TRUE"

The 3 columns with logic are: a = identical(row.names(pheno), pheno$sample_name) b = identical(names(mrna), pheno$sample_name) c = identical(names(mrna), row.names(pheno))

Nairobi-2020 commented 1 year ago

with some character substitutions, I can match the sample names for other datasets, except TCGA. I think there is a strong contradiction in the data, and the question is which one I should stick with?

Nairobi-2020 commented 1 year ago

This is getting a bit, quite a bit frustrating.

So I downloaded TCGA data from cBioportal, and compared the clinical data from there and from MetaGxBreast. They cannot match up. I tried both way, trusting sample name from expr data and trusting sample name from the sample name column in phenotype data.

This is frustrating because I think I need to discard all computations so far.

DarioS commented 1 year ago

It is because the column names of the expression matrix must be valid R syntax. See ?make.names for more information.

> make.names("DUKE_T01-145")
  "DUKE_T01.145"

R applies make.names to all column names of expression table, so the developer had to also do it to the clinical row names.

TCGA has different versions of clinical data. The values depend on the snapshot date. cBioPortal provides outdated data. The latest data is provided by Genomic Data Commons but cBioPortal has Broad Firehose legacy version.

Nairobi-2020 commented 1 year ago

No, the data there is wrong, you messed up with sample names on different occasions. besides, the data is totally out of date, so many much powerful data came out

On Tue, Jun 27, 2023 at 8:00 AM Dario Strbenac @.***> wrote:

It is because the column names of the expression matrix must be valid R syntax. See ?make.names for more information.

make.names("DUKE_T01-145") "DUKE_T01.145"

R applies make.names to all column names of expression table, so the developer had to also do it to the clinical row names.

TCGA has different versions of clinical data. The values depend on the snapshot date. cBioPortal provides outdated data. The latest data is provided by Genomic Data Commons https://portal.gdc.cancer.gov/ but cBioPortal has Broad Firehose legacy version.

— Reply to this email directly, view it on GitHub https://github.com/bhklab/MetaGxBreast/issues/9#issuecomment-1608847085, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADLVLZJP7FWR6DA7VB3NT53XNJZG3ANCNFSM6AAAAAARBG6UZY . You are receiving this because you authored the thread.Message ID: @.***>