calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
411 stars 126 forks source link

cooler file preprocessing for Akita #143

Closed pjlaw closed 1 year ago

pjlaw commented 1 year ago

Hi

I have some microC data from some cell lines that I'd like to implement in a similar way to the Akita manuscript. I was just wondering if you did any processing of the cooler files, beyond running distiller_nf and matrix balancing (iterative correction)

From the tutorial it doesn't look like it, but in the Akita manuscript:

To focus on locus-specific patterns and mitigate the impact of sparse sampling present in even the currently highest-resolution Hi-C maps, we adaptively coarse-grain, normalize for the distance-dependent decrease in contact frequency, take a natural log, clip to (−2,2), linearly interpolate missing bins and convolve with a small 2D Gaussian filter (sigma, 1 and width, 5). The first to third steps use cooltools functions

I'm guessing you used the adaptive_coarsegrain function in cooltools for the 1st step, but I'm uncertain how the distance-dependent normalisation, interpolation, or convolution were implemented. Were those for a specific case in the manuscript and not necessary?

Also I saw you'd rerun the analysis splitting the genome into multiple folds. Do you have any advice/code as to how to implement this?

Thanks Philip

gfudenberg commented 1 year ago

Hi Philip,

You can see the preprocessing in the akita_data_read.py file, e.g. for the adaptive_coarsegrain: https://github.com/calico/basenji/blob/8d1bfd6df195ffa9b4f644c6f76ca2da02c961b3/bin/akita_data_read.py#L178. akita_data.py should handle the splitting into folds-- see options.folds.

Hope that helps! Best, Geoff