calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
409 stars 126 forks source link

Reproduce the Enformer's input sequences split #190

Open sararb opened 8 months ago

sararb commented 8 months ago

I would like to regenerate the input sequences for Enformer/Basenji2 (using basenji_data.py), and for this purpose, I am using the following command line:

python basenji_data.py -g hg38.gaps.bed -u umap_k36_t10_l32_hg38.bed -b hg38.blacklist.rep.bed -l 131072 -crop_bp 8192 -break_t 786432 -s 65599 -t .1 -v .1 -w 128 -o data/input_mseqs -p 8 targets.txt

However, I am observing differences when compared to the sequences.bed file stored here

Can you please confirm if I am using the right options to generate the same sequence split?

davek44 commented 8 months ago

Hi Sara, can you say a little more about your goal? It'll influence how I can best help. It'd be a little tricky for me to track down the exact parameters and basenji_data.py has changed over the years. Is it OK if the recipe is equivalent in quality, but different due to minor tweaks and random number seeds?