biocore-ntnu / epic2

Ultraperformant reimplementation of SICER
https://doi.org/10.1093/bioinformatics/btz232
MIT License
56 stars 9 forks source link

Keeping low number of duplicates #76

Open phoebe460 opened 6 months ago

phoebe460 commented 6 months ago

Hi EPIC2 Developers,

First off, thank you for creating a great peak calling tool. I am planning to use this for my own analysis for ChIP-seq data actually. In that case, I am wondering if there is a way to keep duplicates using --keep-duplicates but setting it to just a low number for instance like 1 instead of just to True, which would remove the majority of PCR duplicates found but still keep a low number of duplicates?

A similar thing can be done using their keep duplicates flag in MACS3 as follows:

--keep-dup

It controls the MACS3 behavior towards duplicate tags at the exact same location – the same coordination and the same strand. You can set this as auto, all, or an integer value. The auto option makes MACS3 calculate the maximum tags at the exact same location based on binomial distribution using 1e-5 as p-value cutoff; and the all option keeps every tag. If an integer is given, at most this number of tags will be kept at the same location. The default is to keep one tag at the same location. Default: 1

If you can clarify this for me before I start using your program, then that would be greatly appreciated.

Thank you, Phoebe

endrebak commented 6 months ago

This is something I could consider. I think it makes sense. It should not be hard to allow keeping some duplicates, even though it will make the runtime a bit longer.

phoebe460 commented 6 months ago

Hi @endrebak,

Thank you for your reply back. Sure, if there is anyway you could support this kind of implementation into your tool, then that would be awesome. It will definitely help me with my own analysis for the ChIP-seq dataset I am currently working with needs to have some but not all duplicates retained.

Keep me posted, Phoebe

endrebak commented 6 months ago

I will not have the time to do this anytime soon. You can preprocess the data yourself and use --keep-duplicates