'ExtractAESPCs' (and getPathPCLs, SubsetPathwayData) functions lost colnames and rownames

gabrielodom / pathwayPCA

integrative pathway analysis with modern PCA methodology and gene selection

https://gabrielodom.github.io/pathwayPCA/

11 stars 2 forks source link

'ExtractAESPCs' (and getPathPCLs, SubsetPathwayData) functions lost colnames and rownames #42

Closed lizhongliu1996 closed 5 years ago

lizhongliu1996 commented 5 years ago

after run the function 'ExtractAESPCs', the PCs results lost the colnames and rownames of the assay_data

gabrielodom commented 5 years ago

To each data frame element of PCs_ls (within SuperPCA_pVals() and AESPCA_pVals()), add the original row names of assayData_df (if they exist).

gabrielodom commented 5 years ago

Should we add a function that takes in sample-ID-labelled assay and phenotype data? We could then match/join the data ourselves within this function. This would also allow us to avoid rownames altogether.

gabrielodom commented 5 years ago

This function is internal. Also, the getPathPCLs() function returns gene rownames as needed.

lxw391 commented 5 years ago

I'm re-opening this issue: the gene names are added, but sample names are not, so still needs to add sample names

lxw391 commented 5 years ago

Also, would it be possible for getPathPCLs to return more than 1 PC?

answer: this would have to be done in function AESPCA_pVals by specifying numPCs parameter

So please add the following sentence to help file of getPathPCLs

"note that to extract more than one PCs, the numPCs parameter in function AESPCA_pvals needs to be modified accordingly"

Question:

how might one know which R file has function getPathPCLs?

lxw391 commented 5 years ago

The sample names is especially important for multiomics analysis. See below, the samples are ordered differently in inputing datasedt, without row names for the samples, it can be error prone to do multi-omics analysis

please also modify the sample columns so that they match in both datasets

lxw391 commented 5 years ago

For function SubsetPathwayData, could we also add rownames or an extra column for sample IDs?

CNVgene_df <- SubsetPathwayData(ovCNV_Surv, "path22")

this is what I get currently:

without sample IDs, it's difficult to merge data for multi-omics analysis

gabrielodom commented 5 years ago

We will take in a data frame with sample IDs as the first column for the assay data, and a data frame with sample IDs as the first column of the response data. Then, we will internally inner join the two data frames, and preserve the sample IDs.

Gabriel is unhappy with this.

gabrielodom commented 5 years ago

Find in a directory: CTRL / CMD + Shift + F

gabrielodom commented 5 years ago

This is requiring even more effort than I had originally anticipated. I have to re-write all examples and testing scripts to account for the fact that we now require sample IDs for both assay and phenotype. This will take at least a full day to finish, and that doesn't even include the time to rebuild the OmicsPath class to now take in the sample IDs or the time to propagate the sample IDs through the code.

gabrielodom commented 5 years ago

I need to update the documentation across the board to mention that we require the response object to be a data frame.

lxw391 commented 5 years ago

@gabrielodom OK, I see. I also thought more about your concern on having createOmics merging datasets for users.

So how about we keep what we have now, except requiring the first column of the response slot in CreateOmics to be a variable called Sample? this way, CreateOmics can extract sample IDs and pass it on to getPathPCLs.

gabrielodom commented 5 years ago

sampleIDs_char has been added as a slot to all Omics* objects. We additionally need get*() and set*() methods before we can move forward with returning these sample IDs.

gabrielodom commented 5 years ago

I wrote the CheckSampleIDs() function to help in object creation. Also, I had to make more edits to the CreateOmics() function to support these Omics*-slot changes.

gabrielodom commented 5 years ago

Updates:

The SubsetPathwayData() function now returns a data frame with leading sample IDs column
The getPathPCLs() function now returns the PCs as a tidy data frame with leading sample IDs column and the loadings as a tidy data frame with leading feature IDs column

gabrielodom commented 5 years ago

@lizhongliu1996, I think this is finished. Please re-test your code. Note that you will have to supply the sample IDs as the first column of the assay and response data frames, as appropriate.

lizhongliu1996 commented 5 years ago

I test it with Proteo and CopyNumber dataset, both SubsetPathwayData() and getPathPCLs() function works, but do notice that when use CreateOmics(), the argument assayData_df = dataset[, -(2:x)] should start from 2nd column.

gabrielodom commented 5 years ago

Yes, the sample ID column is now required for both the assay and response. Thanks!!

gabrielodom commented 5 years ago

Sample IDs are lost in the LoadOntoPCs function.