charles-plessy / CAGEr

Mirror of Bioconductor's CAGEr package repository
https://bioconductor.org/packages/CAGEr
6 stars 4 forks source link

Implement G correction for CAGEexp class #28

Open da-bar opened 4 years ago

charles-plessy commented 4 years ago

Hi Damir, I am exploring the idea of keeping the count of how many extra Gs were removed at each CTSS. Do you by chance have some good test data that the whole CAGEr package could use for examples and regression tests in all the functions related to G correction ? (And maybe we could even replace the example in the vignette with one using that data…)

sga91 commented 4 years ago

What would be a good test data? I am currently analyzing the John Lys procap data with CAGEr, they quality is really good.

charles-plessy commented 4 years ago

Thanks for offering your help! Ideal test data should be:

Bonus points if the dataset has been clearly dedicated to the public domain (CC0 licence for instance) by its laboratory.

Please let me know if I missed something important in the wishing list :)

sga91 commented 4 years ago

I feel the procap data from Drosophila S2 cells would be a good choice then, the only issue I see is the number of samples - there are two replicates but only in control conditions.

charles-plessy commented 4 years ago

@snikumbh, @da-bar, what do you think about this ?

da-bar commented 4 years ago

Hi Charles,

sorry I haven't replied sooner. I think keeping track of the number of removed Gs is a great idea. We are currently really looking into the G removal, and how it impacts the downstream analysis, etc.

Regarding the datasets, I can list a couple of them out of my mind, but I am positive if we don't manage to identify an adequate one, I will be able to dig out other.

Regarding the points you made:

We have the existing dataset (extended by some more developmental stages using nAnTi-CAGE) mapped to the newer version of zebrafish (danRer10 and danRer11). If we think we can continue using zebrafish, I can either subsample the bams, or use a single chromosome (which is not preferred as you stated). The benefit of continuing to use zebrafish is that the downstream part of the vignette intuitively works well with zebrafish embryonic development. I am not so familiar with Drosophila (probably @snikumbh is more familiar in those than me) or C. elegans samples, which have small genomes. Also, a good candidate that follows all the points from your list are certainly the yeast samples from the SLIC-CAGE paper.

Please let me know what you think, and we can proceed then from there.

Best, Damir

sga91 commented 4 years ago

I forgot that the purpose of the analysis was to include it in the vignette and it that case I agree that it makes sense to continue with zebrafish, which I'm not familiar with. The yeast samples may be a good idea. There is also this recent paper from Margaret Fuller in which they perform CAGE-seq in the Drosophila male germline, two stages, from spermatogonia to spermatocytes and in at least two conditions. Also note that Drosophila genome is overall GC richer than yeast and zebrafish.