biocore-ntnu / epic2

Ultraperformant reimplementation of SICER
https://doi.org/10.1093/bioinformatics/btz232
MIT License
55 stars 9 forks source link

Epic2 with no input #44

Open nservant opened 3 years ago

nservant commented 3 years ago

Hi, I tried to run Epic2 without control/input data.
It's running, but the output file is different and I would like to valide with you that it's ok because I did not find any information on that on the doc.

When running Epic2 with input control I have ;

#Chromosome Start   End PValue  Score   Strand  ChIPCount   InputCount  FDR log2FoldChange

Now wihtout control I have :

#Chromosome Start   End ChIPCount   Score   Strand

In practice, it would be great if the output format could be the same, even if some column are set to NA for instance. Many thanks

endrebak commented 3 years ago

I guess I am missing docs for the case where there is no background :/

Why do you want to include always NA columns in the files? Perhaps we can find a workaround :)

nservant commented 3 years ago

This is just because I included Epic2 in a nextflow pipelinewe use to analysis any kind of data (with or without replicates). Therefore, it is easier to manage one type of output file regardless the options you are using. But that's fine, I just updated a bit my code and it's ok

millerh1 commented 3 years ago

I thought I would drop this here, since the conversation seemed relevant... But I think that EPIC2 may have a very high false positive rate without an input control. Here is a screencap from IGV showing the peaks (18,976) found on an input control by EPIC2. The input control has almost 0 peaks when using MACS2 btw. I was wondering, do you have any suggestions for how to go about setting parameters to control the false positive rate without the input control?

Here is an example (screen cap of IGV), all of Chr3 in hg38. I've included the S9.6 sample (this is DRIP-Seq, a kind of diffuse domain sequencing) -- and the corresponding Input control. In both the S9.6 and Input samples, I did not use a control -- I put both in as the treatment in separate EPIC2 runs.

epic2 -t "bam/MSC_1_S1/MSC_1_S1.hg38.bam" -gn hg38 -o "peaks/MSC_1.epicpeaks.bed"
epic2 -t "bam/MSC_input_S15/MSC_input_S15.hg38.bam" -gn hg38 -o "peaks/MSC_Input.epicpeaks.bed"

Also the empirical FDR for the input control was reported as ~0.053 -- the empirical FDR for the S9.6 was report around ~0.021

image

millerh1 commented 3 years ago

With e value set to 100, I saw 1/2x the number of peaks and an empirical FDR of ~0.01

I then tried dropping the e value to 10, which led to 1/4x the number of peaks and an empirical FDR of ~0.002

I then tried e value of 1 (lowest possible), which led to 1/8 the number of peaks an empirical FDR of ~0.0004

It seems like dropping the evalue didn't have as much effect on the treatment sample (S9.6). Anyways, here's what these all look like in a region of Chr19 (below). Hope this helps!

image

endrebak commented 3 years ago

Yes! ChIP-seq data is 95% noise so input should always be used :)

On Thu, Jun 10, 2021 at 2:24 AM Henry Miller @.***> wrote:

With e value set to 100, I saw 1/2x the number of peaks and an empirical FDR of ~0.01

I then tried dropping the e value to 10, which led to 1/4x the number of peaks and an empirical FDR of ~0.002

I then tried e value of 1 (lowest possible), which led to 1/8 the number of peaks an empirical FDR of ~0.0004

It seems like dropping the evalue didn't have as much effect on the treatment sample (S9.6). Anyways, here's what these all look like in a region of Chr19 (below). Hope this helps!

[image: image] https://user-images.githubusercontent.com/44813811/121445883-f4b87880-c957-11eb-984f-2043a6442248.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic2/issues/44#issuecomment-858188642, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURUQLTFNJZA5GMWA45QTTSAA3DANCNFSM4T3KESKQ .

millerh1 commented 3 years ago

Right @endrebak I think the issue was that epic2 was calling so many peaks from input samples which are, by definition, not IP'ed

endrebak commented 3 years ago

I do not have any suggestions really. I think epic2 should be used with input. I just added the option to not use input just to be feature-equivalent with SICER.

On Thu, Jun 10, 2021 at 5:01 PM Henry Miller @.***> wrote:

Right @endrebak https://github.com/endrebak I think the issue was that epic2 was calling so many peaks from input samples which are, by definition, not IP'ed

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic2/issues/44#issuecomment-858698711, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURUUZDCY53W4BVOUTSSDTSDHUZANCNFSM4T3KESKQ .

millerh1 commented 3 years ago

That makes sense! @endrebak Maybe it would be worthwhile to include a disclaimer about using epic2 without an input control, seeing that it can call so many peaks from pure noise. It took me a lot of experimenting to realize this limitation