Open icdh99 opened 1 year ago
I would guess the difference is attributable to my evolving blacklist and perhaps changes to the mappability files over the years. I would take a look at the sites with the largest differences and see if they appear to be weird genomic regions.
Another option is to just proceed with the analysis you hope to do. Despite the differences from the original tracks, both versions are likely fine. I'm just making judgement calls here.
Hi!
I have been using the basenji_data.py script to recreate the tensorflow records used to train the basenji2/enformer model, after the tutorial in the preprocess.ipynb script. I tried three tracks, and the tfr records I created are a bit different than the ones from the original dataset. I took the following three tracks from ENCODE (ENCFF833POA, ENCFF828RQS, ENCFF003HJB). Here are a few plots of how the two targets differ from each other, I plotted both the difference (my track - enformer track) and the two distributions for each 128-bp bin:
![seq4_track4](https://user-images.githubusercontent.com/77203776/231694900-3a90b24b-dca3-4f42-91c4-57f0bf95d5d7.png)
Track 2 is the DNASE track, track 4 is the CHIP TF track.
I used the basenji_data.py with the following changes:
--crop 8192
-x 0
(to call data_write.py) uncomment line 698 + 699:uncomment line 375. (to call data_read.py):
cmd += ' --crop %d' % options.crop_bp
in file basenji_data_write.py: comment line 157, uncomment line 156:
I have used the following files to call the script basenji_data.py:
the targets file looks like this:
Do you have any idea what could cause the discrepancy between the two target files? Thank you in advance, and let me know if I can provide more information or if anything is unclear!