YeoLab / skipper

Skip the peaks and expose RNA-binding in CLIP data
Other
7 stars 3 forks source link

significant difference in read depth between the IP and input data, but some peaks missed #34

Open zhenyu7500 opened 5 days ago

zhenyu7500 commented 5 days ago

Hi, skipper developer,

Thank you for developing skipper!

I have plotted two figures showing the peaks in the region of interest and the read depth of the CLIP data. In the area highlighted in Figure 1, I believe there are additional detectable peaks. Furthermore, the absence of peaks across a large region in Figure 2 has left me quite perplexed.

image image
  1. Do you have any suggestions on how I can adjust the script to better align the results with my expectations?
  2. If I modify the maximum window size for peak detection from 100 to a larger or smaller range, could that help identify the previously missed peaks?
  3. When analyzing consecutive peaks, should I consider merging them?

Thanks for your kindly help!

Best regards,

zhenyu

augustboyle commented 5 days ago

Hi Zhenyu,

  1. Do these overlap annotated transcripts? Skipper only calls peaks in annotated transcripts so if there are additional loci you care about these would have to be added to the GFF or later to the feature file. It's highly likely that those fig 2 peaks are not annotated.

You can generate the count_table output to find all counts in all windows and confirm how many counts there are in overlapping windows.

Other possibilities include is extreme GC content and bias or degenerate nucleotides at those sites, very low counts in the input sample making it difficult to estimate enrichment, and high background/broad coverage in the IP samples such that separating enriched peaks is difficult.

  1. If there are too few counts in the input, increasing the window size could help. If the counts in the input where you have signal in the IP are overwhelmingly 0 then that could suggest the problem. For all the other problems I don't think it would make a difference.
  2. Window annotations outside of introns are usually on the order of 100 nt so if you want to inspect the annotations merging is a little tricky or unhelpful. That said, for summarizing what loci are enriched and what their size is, I think merging is helpful.
zhenyu7500 commented 3 days ago

Thank you for your help! I have confirmed that the reason for the missing peaks is the lack of corresponding annotations.

I now have the annotation information for the missing intervals, and I would like to ask how to generate the following files: partition.bed, partition.features.tsv, partition.nuc, and accession_type_ranking.txt.

It seems that these files were automatically generated by the software, when I firstly run.

Thank you very much for your help!

zhenyu

zhenyu7500 commented 3 days ago

It seems that the files I need (partition.bed, partition.features.tsv, and partition.nuc) can be obtained using the parse_gff.R script with the GTF file and accession_type_ranking.txt as inputs. Can I use the accession_type_ranking.txt file provided on your GitHub without any changes?

What is the meaning of the ranking in the accession_type_ranking.txt file? If I am most interested in lncRNA, do I need to move its order to the first position?

thanks for your help!

Best regards,

Zhenyu

augustboyle commented 2 days ago

Hello,

Overlapping annotations in the GFF are resolved by assigning a top feature_type to the window. The accession type ranking file is the way the top feature_type is determined. All of the feature_types in the GFF must be present in the accession type ranking file.

If you add a new feature_type to the GFF (or directly to any downstream files) then you will need to place it in the accession type ranking file.

You can reorder the rankings in any way you like. Small RNAs are almost universally more abundant than mRNAs, which are almost universally more abundant than lncRNAs in eCLIP libraries, but of course there are always exceptional loci and RNA-binding proteins with very particular types of signal.