ZarnackGroup / racoon_clip

racoon_clip processes your iCLIP and eCLIP data from raw files to single-nucleotide crosslinks in a single step.
https://racoon-clip.readthedocs.io/en/latest/index.html
1 stars 0 forks source link

testing with encode project data #3

Closed fulaibaowang closed 8 months ago

fulaibaowang commented 9 months ago

Hi,

I have a question here.

I am running racoon with some public encode eclip data and comparing the outputs from racoon to those on the encode. Encode project provides 1) bed files of the alignments of the mapped reads. What I was doing is to check 2) the bed files by racoon that give the cross linking site , and see if how much 1) and 2) are overlapped, or to be more specific, if 2) is in the region of 1) using bedtools intersect

I see over 90% the linking sites from racoon are not in the region of 1). Can you speculate on in which step can cause the difference in racoon and encode pipeline?

-- I am running this dataset

my yaml file looks like this:

wdir: testing_real_data/output
infiles: testing_real_data/raw_data/ENCFF149LRA.fastq testing_real_data/raw_data/ENCFF669MUY.fastq
samples: ENCFF149LRA ENCFF669MUY
experiment_type: eCLIP_ENCODE_10ntUMI
quality_filter_barcodes: False
adapter_cycles: 2
gtf: annotation/gencode.v44.annotation.gtf
genome_fasta: annotation/GRCh38.p14.genome.fa
read_length: 45

Many thanks, y

MelinaKlostermann commented 9 months ago

Hi, I think there are two main reasons for this:

  1. The main difference of racoon_clip to encode is that it provides the crosslinked nucleotide, which is 1nt before the read start (the reverse transcription stops before the crosslink) and will therefore not be included in the bed files of the reads provided by ENCODE. However, the crosslinked nucleotide is the closest you can get to the real binding position (for the rational behind please check out Data Science Issues in Studying Protein–RNA Interactions with CLIP Technologies from Chakrabarti et al https://doi.org/10.1146/annurev-biodatasci-080917-013525 and iCLIP data analysis: A complete pipeline from sequencing reads to RBP binding sites from Busch et al doi: 10.1016/j.ymeth.2019.11.008)
  2. To preserve the 5' end of the read (and the crosslink position next to it) racoon_clip allows soft clipping during alignment only at the 3' end. However, soft clipping at both ends is allowed by ENCODE. Therefore, the ENCODE aligned read might actually be even further away from the crosslink just than 1 nt upstream.

In principle, it is a nice idea to test this though. Maybe you could use the aligned reads in the bam files of racoon_clip instead of the crosslink files in bed format?

fulaibaowang commented 9 months ago

Thanks a lot for the explanation! I will read more and do one more test round.

fulaibaowang commented 9 months ago

Hi again,

I think the difference is actually because encode is doing peak calling with CLIPper and racoon does not doing any peak calling. May I ask why it is skipped?

fulaibaowang commented 9 months ago

My above question might be a bit naive because of course different approach use different straigies and racoon gives all the cross-linking site. I did a another test, when I did other way around, meaning checking if the sites in bed file from encode output are overlapped in cross-linking site (-10 to +10) region from racoon. It shows that almost 100% are overlapped. Thinking of that, I am wondering if there could be many false positives in racoon output.

MelinaKlostermann commented 8 months ago

Hi, so to "I think the difference is actually because encode is doing peak calling with CLIPper and racoon does not doing any peak calling. May I ask why it is skipped?"

So fare racoon_clip only includes the preprocessing. I would recommend doing peakcalling with pureclip PureCLIP (https://pureclip.readthedocs.io/en/latest/) and then binding site definition with the BindingSiteFInder from Bioconductor (https://bioconductor.org/packages/release/bioc/html/BindingSiteFinder.html) after racoon_clip preprocessing. I am planning to include these steps into racoon_clip, but focused first on obtaining the single nucleotide crosslinks. The racoon_clip+PureCLIP+BindingSiteFinder combination is what works best for our group so far, of course, there might be other options and at some point a benchmark might be needed :)

MelinaKlostermann commented 8 months ago

"I did a another test, when I did other way around, meaning checking if the sites in bed file from encode output are overlapped in cross-linking site (-10 to +10) region from racoon. It shows that almost 100% are overlapped. Thinking of that, I am wondering if there could be many false positives in racoon output."

I think its partially the same answer not every racoon_clip crosslink belongs to a binding site, there are some random background crosslinks. Therefore, I would always do peakcalling after racoon_clip. racoon_clip is only a more advanced preprocessing pipeline and not a peakcaller (yet :) (If you are interested in an example use case of the racoon_clip + PureCLIP + BindnigSiteFinder combination here is our current manuscript, where we used it for miR-eCLIP data: https://www.biorxiv.org/content/10.1101/2023.09.08.556730v1)

fulaibaowang commented 8 months ago

Many thanks!