ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
110 stars 33 forks source link

Concatenating metadata for runSummaryPostAnalysis #82

Closed UmairAhmadKhan97 closed 3 years ago

UmairAhmadKhan97 commented 3 years ago

Hi,

I'm trying to run post summary analysis for a cohort of 10 patients. I had to run these patients separate times but concatenated the metaData appropriately. Because I made the R and plots directories just point to the current directory rather than a separate,new folder, all the plots/data are in a Sample_XYZ directory in the CWD. I also have these sample directories in separate directories. For example some Sample_XYZ directories with plots and data maybe in the CWD whereas others that belong to the same cohort are in a "batch2" directory relative to the CWD.

What I did so far was sym link the Samples into a new directory R and try the runSummaryPostAnalysis function in a new batch job and R script where the R and plots directory are set to the new R directory with all symlinked Sample directories.

Here is the current error output I am getting. Any clues on this? Thank you!

Running legacy cohort analysis of CNAs and point mutations. Creating directory /gpfs/commons/groups/landau_lab/team_CLL/Celgene_project/R/cohort Creating directory /gpfs/commons/groups/landau_lab/team_CLL/Celgene_project/R/cohort/data Error in data.frame(metaDataFile = metaDataFiles, Rdirectory = Rdirectories) : arguments imply differing number of rows: 5, 12 Calls: runSummaryPostAnalysis -> -> data.frame

ChristofferFlensburg commented 3 years ago

Hey,

That error message is from the last step of the analysis, so you should have gotten most out of it I think. The last step does give you the meanCNV plot (showing CNA rates across the genome), which is nice for large cohorts of 100+ cases, but likely not the main purpose of a 10 person cohort analysis. So you're not missing out on much, but let's try to make it run.

This last step, is some legacy code which is why it has some quirks with what it expects from the metadata. In particular, when you run superFreq on a large meta data file but only picking out one individual, superFreq splits up the meta data file into separate files, and puts it in paste0(dirname(metaDataFile), "/splitMetaData"), ie next to your metadata file. The legacy code depends on these separated files (as opposed to the rest of the code that reads it directly from the metadata file you set up), and as you didn't run superFreq() on the large metadata file, these split files probably don't exist, or at least not all of them.

You should be able to create the split metadata files by loading superFreq and running

library(superFreq)
superFreq:::splitMetaData(metaDataFile, Rdirectory, plotDirectory)

with metaDataFile as the large metadata file you are using for the cohort analysis. Rdirectory and plotDirectory don't really matter here, the function only makes sure they exist so that downstream analysis can create subdirectories for individuals inside them.

With some luck, runSummaryPostAnalysis should complete after that.

UmairAhmadKhan97 commented 3 years ago

I'll try this out, thank you very much Chris!