calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
410 stars 126 forks source link

basenji_data.py does not produce tfr files and have no error message #130

Closed songzeji closed 2 years ago

songzeji commented 2 years ago

Hello Dr Kelley, I have been trying to use basenji_data.py to preprocess data for training. However, I couldn't make any tfr files and was not given an error message. As shown below, all other files and directory are successfully created but the tfrecords directory is empty. Could you please kindly give me some suggestions on how to solve this? Many thanks!

Screenshot 2022-07-29 at 10 01 01 PM

Below is the command I used and the console output:

(basenji) zaksong@Zaks-MBP bin % basenji_data.py -d .1 -l 131072 --local -o data/training -p 8 -t .1 -v .1 -w 128 data/hg38.ml.fa data/wigs.txt  
stride_train 1 converted to 131072.000000  
stride_test 1 converted to 131072.000000  
Contigs divided into  
 Train:  4362 contigs, 2428021387 nt (0.8013)  
 Valid:   536 contigs,  301079678 nt (0.0994)  
 Test:    541 contigs,  301166759 nt (0.0994)  
basenji_data_read.py -w 128 -u mean -c 384.000000 -s 1.000000 data/GSE120063.CHAF1B.U-937.bw data/training/sequences.bed data/training/seqs_cov/0.h5  
basenji_data_write.py -s 0 -e 256 --umap_clip 1.000000 -x 0 data/hg38.ml.fa data/training/sequences.bed data/training/seqs_cov data/training/tfrecords/train-0.tfr  
basenji_data_write.py -s 256 -e 512 --umap_clip 1.000000 -x 0 data/hg38.ml.fa data/training/sequences.bed data/training/seqs_cov data/training/tfrecords/train-1.tfr  
basenji_data_write.py -s 512 -e 768 --umap_clip 1.000000 -x 0 data/hg38.ml.fa data/training/sequences.bed data/training/seqs_cov data/training/tfrecords/train-2.tfr  
basenji_data_write.py -s 768 -e 1024 --umap_clip 1.000000 -x 0 data/hg38.ml.fa data/training/sequences.bed data/training/seqs_cov data/training/tfrecords/train-3.tfr  
basenji_data_write.py -s 1024 -e 1280 --umap_clip 1.000000 -x 0 data/hg38.ml.fa data/training/sequences.bed data/training/seqs_cov data/training/tfrecords/train-4.tfr  
basenji_data_write.py -s 1280 -e 1536 --umap_clip 1.000000 -x 0 data/hg38.ml.fa data/training/sequences.bed data/training/seqs_cov data/training/tfrecords/train-5.tfr  
basenji_data_write.py -s 1536 -e 1792 --umap_clip 1.000000 -x 0 data/hg38.ml.fa data/training/sequences.bed data/training/seqs_cov data/training/tfrecords/train-6.tfr  
basenji_data_write.py -s 1792 -e 1808 --umap_clip 1.000000 -x 0 data/hg38.ml.fa data/training/sequences.bed data/training/seqs_cov data/training/tfrecords/train-7.tfr  
basenji_data_write.py -s 1808 -e 2012 --umap_clip 1.000000 -x 0 data/hg38.ml.fa data/training/sequences.bed data/training/seqs_cov data/training/tfrecords/valid-0.tfr  
basenji_data_write.py -s 2012 -e 2217 --umap_clip 1.000000 -x 0 data/hg38.ml.fa data/training/sequences.bed data/training/seqs_cov data/training/tfrecords/test-0.tfr

Here is the content of my target file:

index   identifier  file    clip    sum_stat    description
0   CHAF1B_U-937    data/GSE120063.CHAF1B.U-937.bw  384 mean    U-937

The file was downloaded from ReMap as bed file and converted to bigWig file.

Here is the bigWig file: https://drive.google.com/file/d/10Y9klKijy5joKmZFAbeSbCrcCdbi7R3C/view?usp=sharing

Here is the link to the output directory: https://drive.google.com/drive/folders/1nLQ9Dnu9ns0Iz-ufd7r3aOWwWqBfrXpR?usp=sharing