ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
110 stars 33 forks source link

superFreq forces R and plots to be created on same path as control samples #84

Closed emauryg closed 2 years ago

emauryg commented 3 years ago

I'm trying to have superFreq output to an absolute path rather than using the base path where the bam files for the quality contorl samples are. The line below seems to force creation of the R and plots directories even if they paths for these directories are given in the input following the instructions. Is there a way to change the code to not for this path hierarchy, and actually use the absolute paths created for the R and plots directories?

https://github.com/ChristofferFlensburg/superFreq/blob/89e53cd0dbc128ac55fa4516b919cf5c00363b78/R/analyse.R#L388

ChristofferFlensburg commented 3 years ago

Hey,

I am not quite sure I understand what you are asking for.

The R directory for the reference normals is created together with the reference normals, and used to store read counts and variant information so that future runs can find and re-use for faster runs. If you don't want this to be created in the same area as the physical reference normal bam files (maybe it's a read-only area or whatever), you can make a new directory referenceNormals anywhere and then link your bam files from referenceNormals/bam'. The R directory with the cache for future runs will then be created inreferenceNormals/R`.

The Rdirectory and plotDirectory are specific to the run and should be separate from the reference normal directory. All these paths can be absolute or relative in the superFreq input.

There might be an X-Y problem here, maybe you can explain what you want to do that you cant? Can you send me the superFreq() command you use and what behaviour you expect?

emauryg commented 3 years ago

Oh that makes sense. In my case I'm trying to implement superFreq in google cloud through terra. However because of localization issues it is difficult have superFreq point to the google bucket with the referenceNormals. It seems that the work around is to localize all the referenceNormals/bam/* and then create a new directory within the running environment. Currently, testing this scenario. Unfortunately, I think this approach would basically have to run through the referenceNormals/R every time since it is not possible to cache.

ChristofferFlensburg commented 3 years ago

Right, ok. This package is not designed with cloud in mind, but hopefully it can still run. Let me see if I understand. You need to explicitly list the files you want access to in your runs, which prevents you from sharing the cached files, as they are created dynamically during the runs. Is that correct?

If so, maybe you could solve/workaround by doing a pilot run (which I guess is good practice for testing anyway), that will set up the cache with reference normal read counts and whatever variants are in the pilot. You can then localise referenceNormals/R/*.Rdata from the pilot, which should give you most of the benefits of the cache, as the read counts won't need to be repeated, and a lot of noise and germline variants are shared across many samples and individuals and are likely to be checked from the pilot.

If you submit lots of runs at the same time against the reference normals, they might clog up reading from the reference normals, and the pilot run should mitigate that. If you just have a few patients, this is not a big issue, and doing what you suggest is likely good enough.

Also, if you get it running and set something up that you think others might find useful, it'd be great if you can share somewhere and I can link it from here.

Let me know how it goes.

emauryg commented 3 years ago

Sounds good. Is there a way to input the referenceNormals/R/*.Rdata to superFreq command? The localization of the normal bams and then creation of the folder in the working directory seems to work thus far, but it does increase the overhead compute time and memory since we have to relocate 10 bam files each time. I'll share the docker and wdl files to run in google cloud along with instructions once I'm done with the run and analysis! I'll reply to this thread with the github repository once it is finished.

ChristofferFlensburg commented 3 years ago

You give the containing referenceNormals directory as input, and superFreq finds the .Rdata itself. It also needs the BAMs to check up any variants in the current sample that are not in the cache. I realise that means a lot of data to move in a cloud setting, for what seems like minor improvement if many variants are already in the cache.

If you are running on exomes, you can likely get away with fewer than 10 reference normals. 5 is enough in most cases and if the samples are high quality you can run with 3 and still be ok. We tested performance on lower number of reference normals in supplementary figure 11, using 10 normals as truth: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007603. We only did one sample, so N=1, but should give you an idea of what you are giving up. Runtime decreased quite a lot, and if you have to also add in copying the files, then I guess the difference will be larger.

Bear in mind that if you change the set of reference normal BAMs, then you need to regenerate the counts matrix (R/fCexonNormal or something like that), or it'll get confused and likely crash when the cached maxtrix doesn't match the bams present. The variants are cached on a by-bam basis, so I think those can be kept even if you subset. The files with variants from the removed samples just wont be used.

ChristofferFlensburg commented 3 years ago

And great if you can share a container! It's been increasing demand for that, but I haven't found the time to set it up, and I lack the expertise, so you would be solving a big problem for me!

emauryg commented 3 years ago

For RNA is there much improvement when using 10 referenceNormals?

ChristofferFlensburg commented 3 years ago

I have not tested that for RNA-Seq. It should benefit more than exomes as they are more variable, and the cost is less as RNA-Seq BAMs typically are smaller, so I think the balance is more towards a larger number of reference normals. I'd guess that the order-of-magnitude difference in precision and recall is comparable to exomes, but I really don't know for sure, sorry.

You may have to test it if you want to know for sure what it does to your data. For that, probably easiest to set up different reference normal directories referenceNormals_10samples and referenceNormals_5samples so that the count matrices in the cache don't confuse each other.

emauryg commented 3 years ago

so right now the pipeline is able to run to completion and produce outputs per samples. I was wondering out of all the outputs that the program spits out, which ones are the ones that you have used the most? On our end we want to keep a table with the CNV calls so that we can do downstream analyses. I know there is a section on your manual about the outputs, but I was wondering which ones do you think are the most high yield to keep to reduce the storage cost since I'm trying to scale this up to > 1K samples.

ChristofferFlensburg commented 3 years ago

Running to completion, hooray! 🎉

If the final CNAs is all you want, then there the CNAsegments .tsv files in plots/data is really all you need. Maybe the CNAbyGene as well for "raw" by gene data if you think something is missed and you want to back and look closer for things like focal amplifications over a known oncogene. So maybe the plots/data directory is enough for your purposes.

If you want to look at the calls for somatic point mutations, then plots/somaticVariants.csv is probably the best machine-readable file to save. Even if you are running other methods for point mutations, it can be nice with a second opinion and consensus calling.

There's a fair bit of viz and diagnostic plots as well that are useful if you only have a few patients, but you're not going to look at any by-patient plots from 1k samples, so not worth saving I guess. Probably look at a few as pilot to make sure they seem ok.

Then all the analysis results are saved .Rdata format in the R directory as well, but not sure that is something you'll use. There are superFreq methods for cohort analysis that produces figures like this one Sup_Figure13.pdf (over ~100 and ~400 RNA samples respectively), that requires some of the .Rdata to be in place, I think clusters.Rdata and allVariants.Rdata, so saving those two might be worth it. That one is looking at genes hit by point mutations as well.

But probably you'll do your own cohort level analysis and then you can probably delete all the .Rdata as well. The code for for cohort level analysis might be clunky to run on your output, it kindof assumes everything was run in the same directories so not sure how that'd work when run on cloud.

Seems like an exciting data set, good luck! :D

emauryg commented 3 years ago

I see, so looking at the CNAsegments*.tsv file, I was wondering if you describe anywhere what the columns mean, and if they provide any sort of indication of weather they are gains or losses (i.e. LFC). Also, is there any particular stat that you look at to filter out false positives? Sorry for all the questions. Thank you for developing this software.

ChristofferFlensburg commented 3 years ago

Only described in a closed issue from February. 😬 This should answer your question though. https://github.com/ChristofferFlensburg/superFreq/issues/71

I then concluded that it's relevant information that should go into the manual, but obviously it has not. One day it'll happen.

For false positives, size of segment is probably the best indicator. Have a look at panel d and e in Figure 1 of the preprint, showing precision and recall: https://www.biorxiv.org/content/10.1101/2020.05.31.126888v1.full If you only keep segments containing more than 100 genes (corresponding to 12Mbp on average), then you should have quite low false positive rate as you see from the precision plot. You risk removing relevant focal event though, so I think better using a lower cut is better. at 30 genes (3.6Mbp) you still have more than 50% precision, but below 10 genes (1.2Mbp) you have mostly noise. Depends on how you want to balance sensitivity vs accuracy ofc, which depends on your research questions.

The genes in the segment are listed in the last column, so you can count them there.

And I'm excited about your progress on the cloud, and on running on a larger cohort than we've ever done, so go ahead and ask if there are issues. Getting a bit curious about what data you will run on, but I don't want to invade your scientific privacy. :P

emauryg commented 3 years ago

So the runs are going relatively smoothly (a bit expensive, but probably the generation of the plots and sub-analyses that superFreq makes by default contributes to it). After going through ~800 samples, I'm seeing calls likes this "A" "AAABB" "AAB" "CL", based on the issue you linked, AAB is a gain, and CL is a complete loss. Does "A" just mean deletion, what does AAABB mean? There are calls also with many AAAAA..., so not sure how to interpret those. Any suggestion?

ChristofferFlensburg commented 3 years ago

Hey!

Happy to hear that it's running! Let me know when you have a containeror anythint to share that I can link. :) Hmm yeah maybe it'd make sense for cohort cloud users to have a mode that stops after the CNV calling if that's all you're after? Not sure how common this use case is, and not sure how much resource it'd actually save. You need the variant calling for the BAFs anyway... 🤔 I'm sure it'd be possible to make a version that is cheaper, but it might require a fair bit of work to optimise. Would probably need stages where the "check reference normals" is a single separate run, and each individual sample would be split up into two runs, before (with the bams) and after (with only .Rdata) the reference normal check. I don't expect to find the time for that anytime soon though. :/

Those calls are counting A and B alleles, so normal AB is one A allele and one B allele. so as you say, A is a loss of the B allele, and similarly AAABB is 5 copies, three A alleles and 2 B alleles, while AAAAA is 5 copies with LOH. The two latter seem pretty exotic (depending on what type of cancer you analyse I guess), and I'd guess most of them are quite small segments and might be false calls?

It can also be larger calls with less exotic CNAs, like a gain AAB, but the LFC/BAF readouts from the data might be slightly off which can look more like a subclonal AAABB or AAAAA. Look at the maypole plot for that, supplementary figure 3a: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007603. You see that a gain AAB can look like a lower clonality AAABB or AAAAA if the BAF readout is a little too high or too low and the data fits better to one of the neighbouring trajectories. So if it's a larger segment with a lower clonality than other CNAs in the sample, then that might be it. Depending on application, if might make sense to classify all of these as "gain" based on the LFC readout. Sometimes we group CNA into complete loss, loss, CNN LOH, gain and amplification when doing cohort analysis, but depends on the research questions.

emauryg commented 3 years ago

Thanks! What about AAAAA? ?

What does the question mark mean?

ChristofferFlensburg commented 3 years ago

The question mark means that while AAAAA is the copy number that best fit the data, it's not a good fit to the data. There is also double question mark AAAA?? which means it's an even worse fit to the data. It's measured through a p-value of the LFCs and BAFs of all the genes and variants in the segment, with the called CNA as null hypothesis.

In RNA, this is typically due to noise like false SNPs giving rise to inconsistent BAF readout, extreme differential expressions, or can be systematic things like expression compensation making the LFC readout not quite reflect the DNA copy number.

You can interpret it as a kind of quality score if you want, where one or two question marks are more uncertain calls.

emauryg commented 3 years ago

oh ok, so it would be advised to remove those calls from the final callset?

ChristofferFlensburg commented 3 years ago

Ultimately it depends on how you want to balance precision v recall, and on your final research question. But I'd keep them in for most applications, especially if you already filter on size. A large (10Mbp or even chromosome arm size) segment with a CNA call usually indicate that SOMETHING is going on in the region even with a "??" after the call. It may or may not be the correct call, but chances are that it's not a normal AB genotype. Looking at the maypole plot shows which calls may be mistakes for each other, especially all the trajectories close to each on on the right side of the plot.

If you want to add more filters, if anything, I'd look at the clonality of the call (ie sample cell fraction), where low clonality calls are more likely to be false positives, although I think large false segments are fairly rare even at low clonality. Also depends how much you care about subclonal events. If you want to find what drives the cancer, then subclonal events are typically less relevant. So if you are looking for early drivers, and most samples have say 60%+ purity, then you could consider filtering out event below 30%-40% clonality or so. But yeah, depends on research question.

emauryg commented 3 years ago

Great. Thank you for your input. We have a more finalized call-set now, and was wondering if we could set up a meeting via zoom to discuss some of the calls to make sure the calls we have are likely to be real.

ChristofferFlensburg commented 3 years ago

Sure, happy to chat and look at your data! Email me at flensburg.c at wehi.edu.au and we can set up a time.

ChristofferFlensburg commented 2 years ago

This is published now, and your container is linked from the readme, so think I can close this. :)

emauryg commented 2 years ago

Haha yes, glad that we got to work together on this "issue".