calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
387 stars 120 forks source link

Tensorflow records generated with basenji_data.py differ from original #159

Open icdh99 opened 1 year ago

icdh99 commented 1 year ago

Hi!

I have been using the basenji_data.py script to recreate the tensorflow records used to train the basenji2/enformer model, after the tutorial in the preprocess.ipynb script. I tried three tracks, and the tfr records I created are a bit different than the ones from the original dataset. I took the following three tracks from ENCODE (ENCFF833POA, ENCFF828RQS, ENCFF003HJB). Here are a few plots of how the two targets differ from each other, I plotted both the difference (my track - enformer track) and the two distributions for each 128-bp bin: seq0_track2_diff seq0_track2 seq4_track4_diff seq4_track4

Track 2 is the DNASE track, track 4 is the CHIP TF track.

I used the basenji_data.py with the following changes: --crop 8192 -x 0 (to call data_write.py) uncomment line 698 + 699:

# add cropped bp
    start = max(0, start-crop_bp)
    end += crop_bp

uncomment line 375. (to call data_read.py): cmd += ' --crop %d' % options.crop_bp

in file basenji_data_write.py: comment line 157, uncomment line 156:

seq_1hot = dna_1hot(seq_dna, n_uniform=False, n_sample=False)
# seq_1hot = dna_1hot_index(seq_dna) # more efficient, but fighting inertia

I have used the following files to call the script basenji_data.py:

bed_file=basenji_preprocess/unmap_macro.bed

genome=genomes/hg38.ml.fa

blacklist=hg38.blacklist.rep.bed

unmappable=umap_k24_t10_l32.bed

txt_file=basenji_preprocess/targets.txt # local file

the targets file looks like this:

index   identifier  file    clip    scale   sum_stat    description
2   ENCFF833POA ENCFF833POA.bw  32  2   mean    DNASE:cerebellum male adult (27 years) and male adult (35 years)
3   ENCFF828RQS ENCFF828RQS.bw  32  2   mean    CHIP:H3K9me3:stomach smooth muscle female adult (84 years)
4   ENCFF003HJB ENCFF003HJB.bw  32  2   mean    CHIP:CEBPB:HepG2

Do you have any idea what could cause the discrepancy between the two target files? Thank you in advance, and let me know if I can provide more information or if anything is unclear!

davek44 commented 1 year ago

I would guess the difference is attributable to my evolving blacklist and perhaps changes to the mappability files over the years. I would take a look at the sites with the largest differences and see if they appear to be weird genomic regions.

Another option is to just proceed with the analysis you hope to do. Despite the differences from the original tracks, both versions are likely fine. I'm just making judgement calls here.