Updating vignettes and adding examples

SPPearce commented 6 months ago

Added examples to virtually all functions in the codespace, and wrote several vignettes. This has required a few changes to some functions as I spotted potential issues or parameters to expose. I ended up making four functions internal only, because they need more significant work:

plotGenomicFeatureDistribution and getGenomicFeatureDistributioncurrently only work for hg38. Would be relatively easy to reenable these with more parameters as in annotateWindows.
calculateFractionReadsInGRangesseems to be returning the fraction of windows that overlap not reads.
countWindowsAboveCutoff could do with its arguments renaming I think, alongside better documentation.
Removed colnames function definion on a qseaSet, which was not working anyway.

I have also removed the currently internal functionsgetAnnotationDataFrame and getAnnotationDataFrameIndividual as they are superseded by getAnnotation and the shift to tidy evaluation via sampleAnnotation in the plotting functions.

lbeltrame commented 4 months ago

Gave this a spin with R 4.3.3 and Bioconductor 3.18, and it looks that it breaks somewhere (EDIT: it looks like some API upstream has changed.. sigh..):

r$> devtools::install_github("cruk-mi/mesa", ref = "vignettes", build_vignettes = TRUE, dependencies = FALSE)
Downloading GitHub repo cruk-mi/mesa@vignettes
── R CMD build ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/tmp/RtmpapzMh0/remotes88510e9514c/cruk-mi-mesa-38fc261/DESCRIPTION’ ... OK
─  preparing ‘mesa’:
✔  checking DESCRIPTION meta-information ... OK
─  installing the package to build vignettes
E  creating vignettes (3m 16.8s)
   --- re-building ‘data-and-qc.Rmd’ using rmarkdown
   --- finished re-building ‘data-and-qc.Rmd’

   --- re-building ‘differentially-methylated-regions.Rmd’ using rmarkdown
   --- finished re-building ‘differentially-methylated-regions.Rmd’

   --- re-building ‘generation.Rmd’ using rmarkdown

   Quitting from lines  at lines 69-76 [unnamed-chunk-3] (generation.Rmd)
   Error: processing vignette 'generation.Rmd' failed with diagnostics:
   wrong args for environment subassignment
   --- failed re-building ‘generation.Rmd’

   --- re-building ‘introduction.Rmd’ using rmarkdown
   --- finished re-building ‘introduction.Rmd’

   --- re-building ‘pca.Rmd’ using rmarkdown
   --- finished re-building ‘pca.Rmd’

   SUMMARY: processing the following file failed:
     ‘generation.Rmd’

   Error: Vignette re-building failed.
   Execution halted

I'm not very familiar with Rmd so I'm not sure where to debug this.

SPPearce commented 4 months ago

@lbeltrame , I've made a change to that vignette, can you try again please

lbeltrame commented 4 months ago

Works, thanks.

SPPearce commented 4 months ago

Great. Please give any feedback on things that aren't clear or you'd like expanding. Or any typos or formatting issues.

SPPearce commented 4 months ago

@lbeltrame , I have made some additional comments in the bits that you asked questions on. Have you been using the package? Do you have any further questions?

lbeltrame commented 4 months ago

I started to. The docs were pretty good to test with the examples provided. I'm waiting for some actual data to come from the lab to do some real-world tests but that will take a bit more time.

lbeltrame commented 2 months ago

Hello, Simon. At last the wet side has generated some sequences, so I've been reading the docs when testing the samples we have. A couple of reviews will follow.

lbeltrame commented 2 months ago

As an addendum, I do wonder if it might make sense to provide a list of general recommendations for BAM files to be used with mesa:

Minimum recommended number of reads
Number of acceptable duplicates (I've seen wildly different numbers in the supplementary information)
Any other useful information

This is to make sure someone knows whether it's "worth" attempting an analysis or it's better to resequence / drop the sample.

SPPearce commented 2 months ago

As an addendum, I do wonder if it might make sense to provide a list of general recommendations for BAM files to be used with mesa:

Minimum recommended number of reads

Number of acceptable duplicates (I've seen wildly different numbers in the supplementary information)

Any other useful information

This is to make sure someone knows whether it's "worth" attempting an analysis or it's better to resequence / drop the sample.

Minimum numbers of reads depends on the enrichment, which depends on the method as well. Hence we used the hyperStableFraction as a metric which combines both. It also depends on what you want to do with the sample; you can apply classifiers that have been designed to work at 1M reads down to that depth. But for discovering of DMRs and similar uses like PCA, I think you want at least 3 million reads with relH of 4; 5 million reads at relH of 3.

The number of acceptable duplicates is not really possible to define, because if you sequence the same sample to higher depth you will get a higher percentage of duplicates. It is essential to deduplicate though, else you get peaks from just piles of duplicates (and I have seen MEDIP-seq papers that have done that...)

SPPearce commented 2 months ago

I'm really hoping that if I bug @Steven-M-Hill enough he might actually approve this PR...

lbeltrame commented 2 months ago

Minimum numbers of reads depends on the enrichment, which depends on the method as well. Hence we used the hyperStableFraction as a metric which combines both. It also depends on what you want to do with the sample; you can apply classifiers that have been designed to work at 1M reads down to that depth. But for discovering of DMRs and similar uses like PCA, I think you want at least 3 million reads with relH of 4; 5 million reads at relH of 3.

My current tests were done with a 120M read sample (it wasn't meant to be this deep, but other samples in the run underperformed) and a 32M read sample. I used the Nature Cancer paper as a reference (so aiming at 30-50M). Although this was a test run, the actual samples we'll be using this on will be "tough", so I opted for a higher depth. Just to make sure: by "million reads" you refer here to the raw reads, or the usable fragments that mesa reports?

And since we were talking about RAM usage elsewhere: for a 120M read sample, without other samples in parallel, I measured 79.8G RAM used (I used our compute cluster's accounting report, so fairly accurate).

depth you will get a higher percentage of duplicates. It is essential to deduplicate though, else you get peaks from just piles of duplicates (and I have seen MEDIP-seq papers that have done that...)

Oh, rest assured I did deduplicate. ;) I asked because the numbers were fairly high (~60%).

And to keep in scope with the discussion on the PR: I think the vignette doesn't mention the "presets" for makeQSet, that is fragmentType (I found out what it did by looking at the code), and I wonder whether fragmentType and its presets are used to also for the size selection, like the min* and max* parameters do.

I spotted also something else which I'll point out shortly.

SPPearce commented 1 month ago

My current tests were done with a 120M read sample (it wasn't meant to be this deep, but other samples in the run underperformed) and a 32M read sample. I used the Nature Cancer paper as a reference (so aiming at 30-50M). Although this was a test run, the actual samples we'll be using this on will be "tough", so I opted for a higher depth. Just to make sure: by "million reads" you refer here to the raw reads, or the usable fragments that mesa reports?

I do mean valid_fragments yes.

And since we were talking about RAM usage elsewhere: for a 120M read sample, without other samples in parallel, I measured 79.8G RAM used (I used our compute cluster's accounting report, so fairly accurate).

Eww, that is pretty high (not entirely surprising, I really didn't optimise that bit of the code). Do you have metrics for the lower values?

depth you will get a higher percentage of duplicates. It is essential to deduplicate though, else you get peaks from just piles of duplicates (and I have seen MEDIP-seq papers that have done that...)

Oh, rest assured I did deduplicate. ;) I asked because the numbers were fairly high (~60%).

Yes, if you have a limited amount of starting material you expect to see many duplicates if you sequence enough (in theory all the way to every unique molecule of DNA that you have captured), so on its own I don't think that is an issue. You can always downsample the fastq or bam files to see what effect it has; would let you see the effect on the hyperStable metric if you are using that.

And to keep in scope with the discussion on the PR: I think the vignette doesn't mention the "presets" for makeQSet, that is fragmentType (I found out what it did by looking at the code), and I wonder whether fragmentType and its presets are used to also for the size selection, like the min* and max* parameters do.

I have added a description of fragmentType, along with a confirmation that this doesn't affect the coverage calculation.

lbeltrame commented 1 month ago

The HMMcopy parameters (GC and mappability) are unclear, in particular if one does not use the NCBI variant of BSGenome (the code seemingly hardcodes it). There should be at least a small text if possible informing users where to obtain the required data.

SPPearce commented 1 month ago

The HMMcopy parameters (GC and mappability) are unclear, in particular if one does not use the NCBI variant of BSGenome (the code seemingly hardcodes it). There should be at least a small text if possible informing users where to obtain the required data.

Agreed, the data files for hg38 are included but not anything else. I've made an issue

SPPearce commented 3 weeks ago

The HMMcopy parameters (GC and mappability) are unclear, in particular if one does not use the NCBI variant of BSGenome (the code seemingly hardcodes it). There should be at least a small text if possible informing users where to obtain the required data.

@lbeltrame , I added another paragraph to describe these in the vignette, does it make sense? Admittedly I haven't checked how it looks when rendered.

lbeltrame commented 3 weeks ago

Looks fine to me, it's very clear.

On that topic, @Steven-M-Hill, what's left for this to get merged? I'm keeping a hodge podge of patches on top of the latest release and having this merged would simplify things a lot.

SPPearce commented 3 weeks ago

Looks fine to me, it's very clear.

On that topic, @Steven-M-Hill, what's left for this to get merged? I'm keeping a hodge podge of patches on top of the latest release and having this merged would simplify things a lot.

Yeah, it'd be good to get these various PRs in. Unfortunately I can no longer go over to Steven's desk and hassle him in person 🙁

lbeltrame commented 1 week ago

Noticed something else. The code refers to qseaInput as CNV method, but this is not mentioned anywhere in the documentation, API docs or vignettes.

cruk-mi / mesa

Updating vignettes and adding examples #14