ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
109 stars 33 forks source link

WriteXLS error in large cohort analysis #109

Closed leap-ahead225 closed 1 year ago

leap-ahead225 commented 1 year ago

Hello, Thank you for making this amazing software.

I encountered this error during 81 samples cohort analysis.

How can I fix this? I'm using R 4.3.0 and superFreq 1.4.5

Loading /home/ngs/superFreq_default/MyAnalysis/plots/cohort/data/R/allVariants.Rdata...done. Saving project variants to /home/ngs/superFreq_default/MyAnalysis/plots/cohort/data/cohortAnalysis/R/projects/myProject/variants.Rdata . Loading /home/ngs/superFreq_default/MyAnalysis/plots/cohort/data/R/clusters.Rdata...done. Calculating gain..loss..amplification..complete loss..CNN LOH..somatic SNVs...............................................................................................mutated genes................................................................................................. Saving meanCNV...done. Plotting mean CNV...plotting CNVs..plotting mutations..done. Setting up output...done. printing to csv...printing to xls...done. mutation matrix...done. Printing somatic variants to /home/ngs/superFreq_default/MyAnalysis/plots/cohort/data/cohortAnalysis/plots/projects/myProject/somaticVariants.xls. .....

more VariantAnnotation info..writing to xls... Error in WriteXLS("XLsomatics", outfile) : One or more of the data frames named in 'x' exceeds 65,535 rows or 256 columns

ChristofferFlensburg commented 1 year ago

Oh, that must be too many columns then! There are checks for the 65k rows, and it's truncating the output if needed, but it does not check for too many columns...

I'm honestly not sure what I want it to do at this point. The file is intended as a human readable variant list, and for that an .xls with samples as tabs is convenient, and usually worth the hassle of having to deal with the limitations of excel formats. In this case though, I don't see a human flipping through variant lists for more than 256 samples? But maybe ppl would us it as look-up for interesting variants and then important that all columns are there.. But can't be more than 256. Hmm.

I think the solution is to also output variants to a single .csv, with sample as a column. I used to do that, and there is in fact a commented-out part in code...

    #this outputs all variants (beyond 65k) to a single .csv.
    #purpose was to simplify downstream analysis without hving to parse
    #a multi-tab .xls that may not even have all variants.
    #however, this file is rarely used afaik
    #and for some reason it often causes aout-of-memory crashes.
    #hence removed. Use .vcfs below for automated downstream analysis, or Rdata.
    #if ( !onlyForVEP ) catLog('Writing to .csv...')
    #for ( sample in names(somatics) ) {
    #  somatics[[sample]]$sample = rep(sample, nrow(somatics[[sample]]))
    #}
    #allSomatics = do.call(rbind, somatics)
    #if ( !onlyForVEP ) write.csv(allSomatics, gsub('.xls$', '.csv', outfile))
    #if ( !onlyForVEP ) catLog('done!\n')

The purpose here would be a bit different, human readable output for data with more than 256 samples.... I'm a bit worried about the out-of-memory issues my past self added though. Maybe easier solution is to just truncate the columns and people will have to look elsewhere for variant information, but at least the run will go through.

Idk, what do you think, have you looked at this file before?

leap-ahead225 commented 1 year ago

Thank you for your prompt reply. I have not seen this file yet. As for the variant, I am looking at the VCF output. Is it difficult to solve this problem by outputting in XLSX instead of XLS?

ChristofferFlensburg commented 1 year ago

Hmm, yeah xlsx would extend the 256 limit, but there would still be a limit. Unlikely to ever be reached though.. I'd need a different package to write to xlsx, but that seems doable...

Altough looking back again I see you said 81 samples, so actually not sure why it'd try to output 256 columns. Could you have a look at the output before the crash, for example the mutation matrix or meanCNV.pdf and confirm that they have 81 samples (or at least less than 256). They should be in /home/ngs/superFreq_default/MyAnalysis/plots/cohort/data/cohortAnalysis/plots/projects/myProject.

If you just want .vcfs of the somatic variants, then they are output from the default superFreq run without doing a cohort run (in plots/myIndividual/somatics/mySample.vcf). The default superFreq run outputs all the somatic variants (somaticP > 0.5), while the cohort run only outputs somatic coding variants I believe, ie the variants you see in the mutation matrix. So depending on exactly you want to do with the VCFs, you may not need to run the cohort run at all.

leap-ahead225 commented 1 year ago

Thanks for the quick response. I looked at the sample count and the number of samples excluding Normal was 96 samples. I am not sure why the sample count was 256, but if I include Normal, I get 167 samples. The byType and Normals folders were empty, is there any particular problem?

leap-ahead225 commented 1 year ago

I am also interested in the cohortAnalyseBatchContrast for cohort analysis, as it seems to reduce the number of samples and avoid errors if the analysis is done per subgroup. How can I specify the subgroup in the metadata?

ChristofferFlensburg commented 1 year ago

Hi, so a few points here.

1) As you have less than 256 samples, it seems like there is another issue somewhere upstream of the crash. This problem might be hard for me to track down as I don't know where it happens, and it's not really feasible for you send your probably access restricted data for me to reproduce. But if you could you look through all the plots and output in that directory and check if they have the expected number of samples or not, that'd help me narrow down where the issue is, and I'll be able to at least have a look.

2) More practically for you, this crash is at the end of the run, I think most of the output is already done. We might not have to fix this for you to get what you need. So if you let me know what you're looking for, then maybe I can help with that.

3) There is technically support for groups and contrasts, but I haven't used that in many years. It uses a column PROJECT.SUBGROUP in the metaData.tsv. So if you're for example doing a drug response study in AML, you'd set the PROJECT column to AML, and the PROJECT.SUBGROUP column to for example responding and resistant. And you'd then set project to AML in cohortAnalyseBatchContrast, and subgroups1 and subgroups2 to responding and resistant. But no guarantees on that function, and if it doesn't work, I won't be too keen to spend time on it, I'd recommend running each subgroup as separate project instead.

leap-ahead225 commented 1 year ago

Thank you for contacting us. Regarding the Excel file, I think this problem occurs because re-running the file with a small cohort generates three columns per sample (sample name_var, _VAF,_RD).

Since the files needed were the mean CNV and mutation of the cohort, it was sufficient to generate meanCNV.pdf. Thank you very much.

I will try cohortAnalyseBatchContrast in the way you suggested. If it does not work, I will try manually separating the subgroups. Thank you very much.

ChristofferFlensburg commented 1 year ago

Ahh, 3 tabs per sample explains it! Thank you for rerunning with smaller cohort! Happy that there might not be other bugs then.

I added a todo for me to change that part to a .xlsx, thanks for suggestion, but as you are ok for now, I will not rush it. Hope you get good results out!