Tensorflow records generated with basenji_data.py differ from original

Hi!

I have been using the basenji_data.py script to recreate the tensorflow records used to train the basenji2/enformer model, after the tutorial in the preprocess.ipynb script. I tried three tracks, and the tfr records I created are a bit different than the ones from the original dataset. I took the following three tracks from ENCODE (ENCFF833POA, ENCFF828RQS, ENCFF003HJB). Here are a few plots of how the two targets differ from each other, I plotted both the difference (my track - enformer track) and the two distributions for each 128-bp bin: seq0_track2_diff seq4_track4_diff

Track 2 is the DNASE track, track 4 is the CHIP TF track.

I used the basenji_data.py with the following changes: --crop 8192 -x 0 (to call data_write.py) uncomment line 698 + 699:

# add cropped bp
    start = max(0, start-crop_bp)
    end += crop_bp

uncomment line 375. (to call data_read.py): cmd += ' --crop %d' % options.crop_bp

in file basenji_data_write.py: comment line 157, uncomment line 156:

seq_1hot = dna_1hot(seq_dna, n_uniform=False, n_sample=False)
# seq_1hot = dna_1hot_index(seq_dna) # more efficient, but fighting inertia

I have used the following files to call the script basenji_data.py:

bed_file=basenji_preprocess/unmap_macro.bed

genome=genomes/hg38.ml.fa

blacklist=hg38.blacklist.rep.bed

unmappable=umap_k24_t10_l32.bed

txt_file=basenji_preprocess/targets.txt # local file

the targets file looks like this:

index   identifier  file    clip    scale   sum_stat    description
2   ENCFF833POA ENCFF833POA.bw  32  2   mean    DNASE:cerebellum male adult (27 years) and male adult (35 years)
3   ENCFF828RQS ENCFF828RQS.bw  32  2   mean    CHIP:H3K9me3:stomach smooth muscle female adult (84 years)
4   ENCFF003HJB ENCFF003HJB.bw  32  2   mean    CHIP:CEBPB:HepG2

Do you have any idea what could cause the discrepancy between the two target files? Thank you in advance, and let me know if I can provide more information or if anything is unclear!

calico / basenji

Tensorflow records generated with basenji_data.py differ from original #159