Open aga-relation opened 1 year ago
Hi, the features are described here: https://github.com/calico/basenji/blob/master/manuscripts/cross2020/targets_human.txt
The score represents the alternative allele prediction subtracted by the reference allele prediction, summed across the sequence length. The sequence length with predicted values is 131072, and the predictions occur in 128 bp bins.
I don't remember exactly which augmentations were used. I always do reverse complement in addition to the forward strand and take the average. I may or may not have done shifts as well. If it's really important to you, I can try to track this down.
Thank you very much for your reply; this is very helpful.
One thing that puzzles me: by summing across the entire sequence length, you are capturing activity changes in other genes in the window. How does this allow for testing whether a single, specific gene is affected by a variant? Hypothetically, could another gene in the window be affected instead?
For the Basenji2 model, variants could only influence predictions at positions within 20kb, so it's pretty unlikely to affect multiple genes. For the Enformer model, I think you're correct that moving to gene-specific scores makes sense. Unfortunately, we don't currently have great scripts to do this.
Also the EMS annotations are variant but not gene-specific. So if the variant affects multiple genes, summing them together seems like a reasonable thing to do.
Hi,
In your Whole Blood Fine-Mapping paper, it's mentioned that "5313 Basenji features corresponding to functional activity predictors" were used.
Would it be possible to have some more information as to how these features were computed? What (if any) augmentations were used, and how were the sequences processed (how many bins, how were they aggregated)?
Thank you very much!