calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
404 stars 123 forks source link

Basenji Features used for Whole Blood Fine-Mapping #153

Open aga-relation opened 1 year ago

aga-relation commented 1 year ago

Hi,

In your Whole Blood Fine-Mapping paper, it's mentioned that "5313 Basenji features corresponding to functional activity predictors" were used.

Would it be possible to have some more information as to how these features were computed? What (if any) augmentations were used, and how were the sequences processed (how many bins, how were they aggregated)?

Thank you very much!

davek44 commented 1 year ago

Hi, the features are described here: https://github.com/calico/basenji/blob/master/manuscripts/cross2020/targets_human.txt

The score represents the alternative allele prediction subtracted by the reference allele prediction, summed across the sequence length. The sequence length with predicted values is 131072, and the predictions occur in 128 bp bins.

I don't remember exactly which augmentations were used. I always do reverse complement in addition to the forward strand and take the average. I may or may not have done shifts as well. If it's really important to you, I can try to track this down.

aga-relation commented 1 year ago

Thank you very much for your reply; this is very helpful.

One thing that puzzles me: by summing across the entire sequence length, you are capturing activity changes in other genes in the window. How does this allow for testing whether a single, specific gene is affected by a variant? Hypothetically, could another gene in the window be affected instead?

davek44 commented 1 year ago

For the Basenji2 model, variants could only influence predictions at positions within 20kb, so it's pretty unlikely to affect multiple genes. For the Enformer model, I think you're correct that moving to gene-specific scores makes sense. Unfortunately, we don't currently have great scripts to do this.

Also the EMS annotations are variant but not gene-specific. So if the variant affects multiple genes, summing them together seems like a reasonable thing to do.