calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
387 stars 120 forks source link

Suggestion for tutorial notebooks (preprocessing) #129

Open a1ultima opened 1 year ago

a1ultima commented 1 year ago

Hi Dave,

Great work.

We are trying to run Basenji for a plant genome (A.thaliana), the tutorials seem to assume we already have alignments done with bowtie2.

Could you please explain how we would go from the downloading of data (dnase, chip-seq, cage) to alignments?

I.e. how would we get the required bam files used by bam_cov.py

Many thanks in advance!

davek44 commented 1 year ago

I actually prefer to use BWA or STAR these days. In general, you can align the data in whatever way you think is best. My bam_cov.py script tries to do smart things with multi-mappers, so you'll just want to set the alignment program to output them (usually up to a max of 10-20).

In addition, I've determined that the log fold change track generated by the MACS2 peak caller (which is made available from the ENCODE site) also works fine.

If that doesn't answer your question, reply with more details about the current state of your training data and your research objective, and I'll try to provide more guidance.

liqingbioinfo commented 1 year ago

Hello Dave,

    You mentioned that the "log fold change track generated by the MACS2 peak caller" also works. However, these peak caller files contain very sparge peak values. Can CNNs learn messages from these genome-wide sparse peaks? I really appreciate it if you please elaborate a bit more on this point. 

Yours sincerely Leah

davek44 commented 1 year ago

The fewer peaks there are in the data, the harder it will be for any ML algorithm to learn to predict it well. But for nearly every dataset I've seen that has reasonable signal:noise ratio, CNNs can learn to predict the peaks, even if they are sparse.

liqingbioinfo commented 1 year ago

Morning Dave

 I really appreciate your prompt responses! You've mentioned a reasonable signal: noise ratio. Do you think the ChIP-seq peaks from MACS2 peak caller works? I mean not the log fold change track generated by the MACS2, just the original peaks. Many thanks for answering my questions. 

Yours sincerely Leah

davek44 commented 1 year ago

Yes, I think the MACS2 peak caller generally works well.