calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
408 stars 124 forks source link

Blacklist file & unmappable regions #155

Closed icdh99 closed 1 year ago

icdh99 commented 1 year ago

Hi!

I am working on generating tensor flow records that can be used for the Enformer model, and am currently validating my work in progress by using the basenji_data.py script with 3 tracks that are included in the Basenji2/enformer dataset.

`bed_file=data/basenji_preprocess/unmap_macro.bed output_dir=data/basenji_preprocess/output_tfr genome=genomes/hg38.ml.fa txt_file=data/basenji_preprocess/target_march21.txt

basenji/bin/basenji_data.py -g $bed_file -l 131072 --local --restart --crop 8192 -o $output_dir -p 8 -w 128 $genome $txt_file`

The tensor flow records contain the sequence and the target. All sequences are similar to the ones in the tensorflow records I retrieved from basenji_barnyard/data/human/tfrecords/. However, there seems to be some divergences in the targets.

I have been reading the manuscript for clues and came across this passage: We applied several transformations to these tracks to protect the training procedure from large incorrect values. First, we collected **blacklist regions from ENCODE and added all RepeatMasker satellite repeats** [[54], which we found to frequently collect large false positive signal [[55] We further defined **unmappable regions** of >32 bp where 24-mers align to >10 genomic sites using Umap mappability tracks [[56]. We set signal values overlapping these regions to the 25th percentile value of each dataset. Finally, we soft clipped high values with the function f(x) = min(x, tc + sqrt(max(0, x − tc))). Above the threshold tc (chosen separately for each experiment and source), this function includes only the square root of the residual x − tc rather than the full difference. We manually chose tc per experiment and source by inspecting the maximum values, aiming to reduce the contribution of rare very large values that one would not expect to generalize to other genomic locations. Via this procedure, we decided to clip all CAGE data with tc = 384, ENCODE with tc = 32, and GEO with tc = 64.

I have set the clip, mean and sum stat values accordingly in the targets.txt file. I am using the file basenji/tutorials/data/unmap_macro.bed for the option -g GAPS_FILE according to basenji_preprocess.ipynb.

I think that the -b option refers to the first bold section (blacklist + satellite repeats), and the -u option to the second bold section (unmappable regions). I couldn't find these files in this repository. Could you please point me to these files if they are available, or otherwise supply some information on how to reproduce these files?

Thank you in advance!

davek44 commented 1 year ago

Yes, those files have been updated since I originally wrote the tutorials. The Basenji2/Enformer data processing used these versions: Blacklist: https://storage.googleapis.com/basenji_barnyard2/hg38.blacklist.rep.bed Unmappable: https://storage.googleapis.com/basenji_barnyard2/umap_k36_t10_l32_hg38.bed

icdh99 commented 1 year ago

Thank you for your quick response! I did not notice these files in the data repository but found them now :)