GavinHaLab / CRPCSubtypingPaper

Analysis code used in ctDNA CRPC phenotype manuscript
Other
2 stars 1 forks source link

What is the correct format for the input feature matrix prior to pickling? #1

Open alanasweinstein opened 1 year ago

alanasweinstein commented 1 year ago

Hello @GavinHaLab and @denniepatton,

Thank you for sharing the analysis pipeline used in your CRPC subtyping paper and for such an important contribution to the community.

I'm working my way through the pipeline -- I ran ichorCNA and Griffin, and I would now like to implement ctdPheno. However, I'm having trouble determining how to format the input feature matrix (before pickling). I understand from the documentation that the matrix contains features for both reference data and samples of interest, and I am guessing the matrix includes the features shared in the CRPCSubtypingPaper/Data/ directory. But what exactly should the matrix look like before it is pickled? For example, which data are the rows vs. columns (features vs. samples, or the transpose?) and is there a required naming convention, order of features, additional normalization or formatting, etc. beyond concatenating the features as given in the Data directory above?

I searched the code and couldn't find an example of the feature matrix before pickling; just the name of the pickle file hard-coded in main() of "ctdPheno.py". I did find a reference to formatting a pickle file with an "ExploreFM.py" pipeline in the SupervisedLearning section: https://github.com/GavinHaLab/CRPCSubtypingPaper/blob/52119b92e8383533e3ca12bfac4a677492645f1b/SupervisedLearning/XGBClassifier.py#L499 but found no additional info on that pipeline.

Could you perhaps please provide an example file showing the required format of the input matrix, and/or direct me toward any documentation explaining this that I may have missed?

Thank you very much, Alana Weinstein

denniepatton commented 1 year ago

Hi Alana, thank you for your interest!

The data frames expected by ctdPheno, both for the reference cohort (LuCaPs and healthy donors) and test samples, simply consists of sample names as row names/indices, and feature-site names as columns, exactly as they are presented in the CRPCSubtypingPaper/Data/ directory. In addition each data frame also requires one more column: for the references "Subtype" which should be either HD or ARPC/NEPC (or some other 2-way comparison), and for the test samples "TFX" which you have found with ichorCNA. Apologies for any confusion - we are hoping to release ctdPheno as a more polished stand-alone tool, as opposed to a script for reproducing the results from the paper, in the near future! In the mean time I have attached 2 updated scripts which I hope will help you out: ctdPheno_v2.py is more generalized and will take in tidy-form data frames, and ctdPheno_v2_CD.py will take in those feature matrices references above, directly (I have tested it to make sure), so if you just replace the hardcoded references in that script with your own sample values and tumor fraction estimates that should do the trick! I did produce a simplified file with reference labels, also attached. If you decide to use ctdPheno_v2.py, the args descriptions should make the inputs clear, but of course let us know if you have any other problems.

All the best, Robert Patton ForAlana.zip