calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
410 stars 126 forks source link

basenji_data.py error #105

Open beoungl opened 2 years ago

beoungl commented 2 years ago

I am running basenji_data.py on Micro-C data using bigWig format for basenji_data.py, and I ran into issue here.

stride_train 1 converted to 131072.000000 stride_test 1 converted to 131072.000000 I'm confused by this event ordering: gstart - cend

Here is the command I used

basenji_data.py -d .1 -g unmap_macro.bed -l 131072 --local -o micro_c -p 8 -t .1 -v .1 -w 128 hg38.ml.fa heart_wigs.txt

and components in heart_wigs.txt file

index identifier file clip sum_stat description 0 Cancer_1 mapped_cancer1.PT.bigwig 384 sum Cancer_1 1 Normal_1 mapped_normal1.PT.bigwig 384 sum Normal_1 2 Cancer_2 mapped_cancer2.PT.bigwig 384 sum Cancer_2 3 Cancer_3 mapped_cancer3.PT.bigwig 384 sum Cancer_3

From the looks of it, this part of the code seems to be causing the issue here.

https://github.com/calico/basenji/blob/master/basenji/genome.py

Is it ok for me to directly change the code to modify accept gstart - cend combination, or was there a specific reason for you to do it that way?

davek44 commented 2 years ago

The error reports that a contig was observed with an end point that is less than the genomic chromosome's start point. Most likely, it indicates that you have files from different reference genome builds. Are your bigwig files mapped to hg38? Do they use the same chromosome labels as the fasta file you're using? If that all looks OK, I would use pdb or print statements to figure out exactly what's confusing the program at the point it returns that error.

biginfor commented 5 months ago

I encountered the same error as you. When I removed -g umap_macro.bed, the error disappeared. Is it possible that the above file is based on hg19 and your fasta file is hg38?

davek44 commented 5 months ago

Also, all future development will occur here: https://github.com/calico/baskerville You can create training datasets using the analogous hound_train.py