calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
410 stars 126 forks source link

Issue running basenji_data.py #154

Open petnas opened 1 year ago

petnas commented 1 year ago

Hi! First of all, great work on Basenji!

I tried to run basenji_data.py and the provided example data runs successfuly but when I change the provided .bw file to my file it throws an error.

Example: /data/leuven/345/vsc34527/miniconda3/envs/basenji5/bin/python /vsc-hard-mounts/leuven-data/345/vsc34527/enformer_training/basenji/bin/basenji_data.py -s .1 -g data/unmap_macro.bed -l 131072 --local --restart -o data/heart_l131k -p 8 -t .1 -v .1 -w 128 data/hg19.ml.fa data/heart_wigs.txt

My data: /data/leuven/345/vsc34527/miniconda3/envs/basenji5/bin/python /vsc-hard-mounts/leuven-data/345/vsc34527/enformer_training/basenji/bin/basenji_data.py -s .1 -g data/unmap_macro.bed -l 131072 --local --restart -o data/microglia_output -p 8 -t .1 -v .1 -w 128 data/hg19.ml.fa data/microglia.txt

Error:

basenji_data_write.py -s 1679 -e 1858 --umap_clip 1.000000 -x 0 data/hg19.ml.fa data/microglia_output/sequences.bed data/microglia_output/seqs_cov data/microglia_output/tfrecords/test-0.tfr Traceback (most recent call last): File "/vsc-hard-mounts/leuven-data/345/vsc34527/enformer_training/basenji/bin/basenji_data_write.py", line 240, in main() File "/vsc-hard-mounts/leuven-data/345/vsc34527/enformer_training/basenji/bin/basenji_data_write.py", line 106, in main seq_pool_len = h5py.File(seqs_cov_files[0], 'r')['targets'].shape[1] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/data/leuven/345/vsc34527/miniconda3/envs/basenji5/lib/python3.8/site-packages/h5py/_hl/group.py", line 328, in getitem oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: "Unable to open object (object 'targets' doesn't exist)"

The error is much longer and this is just the first iteration to save space.

The bw file is from GEO and it is based on hg38. Is it the case that basenji_data.py can only use bw files made from bed files using your cam_cov.py?

I was wondering if anyone ever encountered a similar issue or is it just me making a silly mistake somewhere...

Thank you, Petras

icdh99 commented 1 year ago

Hi!

I'm not sure if this will help you, but I'm working on the same thing right now and I managed to use .bw files from encode without using the bam_cov.py script. In my output the command for the basenji_data_write.py call looks like this:

basenji/bin/basenji_data_write.py -s 20480 -e 20736 --umap_clip 1.000000 -x 0 genomes/hg38.ml.fa data/basenji_preprocess/output_tfr/sequences.bed data/basenji_preprocess/output_tfr/seqs_cov data/basenji_preprocess/output_tfr/tfrecords/train-80.tfr

I did put a print statement to get this line (around line 440 in basenji_data.py) so it might be different due to that. But in your error, it looks like the last 3 options are missing somehow for your call to the write script?

Not sure if it helps & otherwise no worries!

davek44 commented 1 year ago

Hi Petras, most likely the problem is that your BigWig is hg38 and your FASTA is hg19. Try again with hg38 FASTA. You'll want to drop the blacklist and unmappable, or replace with hg38 versions, too.