Closed fulaibaowang closed 8 months ago
Hi, I think there are two main reasons for this:
Data Science Issues in Studying Protein–RNA Interactions with CLIP Technologies
from Chakrabarti et al https://doi.org/10.1146/annurev-biodatasci-080917-013525 and iCLIP data analysis: A complete pipeline from sequencing reads to RBP binding sites from Busch et al doi: 10.1016/j.ymeth.2019.11.008)In principle, it is a nice idea to test this though. Maybe you could use the aligned reads in the bam files of racoon_clip instead of the crosslink files in bed format?
Thanks a lot for the explanation! I will read more and do one more test round.
Hi again,
I think the difference is actually because encode is doing peak calling with CLIPper and racoon does not doing any peak calling. May I ask why it is skipped?
My above question might be a bit naive because of course different approach use different straigies and racoon gives all the cross-linking site. I did a another test, when I did other way around, meaning checking if the sites in bed file from encode output are overlapped in cross-linking site (-10 to +10) region from racoon. It shows that almost 100% are overlapped. Thinking of that, I am wondering if there could be many false positives in racoon output.
Hi, so to "I think the difference is actually because encode is doing peak calling with CLIPper and racoon does not doing any peak calling. May I ask why it is skipped?"
So fare racoon_clip only includes the preprocessing. I would recommend doing peakcalling with pureclip PureCLIP (https://pureclip.readthedocs.io/en/latest/) and then binding site definition with the BindingSiteFInder from Bioconductor (https://bioconductor.org/packages/release/bioc/html/BindingSiteFinder.html) after racoon_clip preprocessing. I am planning to include these steps into racoon_clip, but focused first on obtaining the single nucleotide crosslinks. The racoon_clip+PureCLIP+BindingSiteFinder combination is what works best for our group so far, of course, there might be other options and at some point a benchmark might be needed :)
"I did a another test, when I did other way around, meaning checking if the sites in bed file from encode output are overlapped in cross-linking site (-10 to +10) region from racoon. It shows that almost 100% are overlapped. Thinking of that, I am wondering if there could be many false positives in racoon output."
I think its partially the same answer not every racoon_clip crosslink belongs to a binding site, there are some random background crosslinks. Therefore, I would always do peakcalling after racoon_clip. racoon_clip is only a more advanced preprocessing pipeline and not a peakcaller (yet :) (If you are interested in an example use case of the racoon_clip + PureCLIP + BindnigSiteFinder combination here is our current manuscript, where we used it for miR-eCLIP data: https://www.biorxiv.org/content/10.1101/2023.09.08.556730v1)
Many thanks!
Hi,
I have a question here.
I am running racoon with some public encode eclip data and comparing the outputs from racoon to those on the encode. Encode project provides 1) bed files of the alignments of the mapped reads. What I was doing is to check 2) the bed files by racoon that give the cross linking site , and see if how much 1) and 2) are overlapped, or to be more specific, if 2) is in the region of 1) using
bedtools intersect
I see over 90% the linking sites from racoon are not in the region of 1). Can you speculate on in which step can cause the difference in racoon and encode pipeline?
-- I am running this dataset
my yaml file looks like this:
Many thanks, y