calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
410 stars 126 forks source link

Tensorflow / Traceback error. #193

Closed jesspeers closed 7 months ago

jesspeers commented 7 months ago

Hi,

I'm hoping to use Basenji on a HPC using slurm so have been attempting to work through the tutorials to ensure my install works correctly and to learn about how to run the scripts. (The tutorials are very well explained - thank you for making it so accessible!)

I have successfully run the first tutorial (preprocess) but am having issues with the train_test tutorial.

I submitted the following to a GPU node on our cluster:

python bin/basenji_train.py -o tutorials/models/heart tutorials/models/params_small.json data/heart_l131k_redownload

I tried running it on the data generated by the preprocessing data tutorial and I also tried downloading the data from the start of the train_test tutorial and had the same issue both times.

I got the following error:

2024-04-22 11:57:47.666785: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-crit
ical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:AutoGraph could not transform <function SeqDataset.generate_parser.<locals>.parse_proto at 0x7f497a35c700> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: invalid syntax (tmp1xy_cyln.py, line 39)
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function SeqDataset.generate_parser.<locals>.parse_proto at 0x7f49785401f0> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: invalid syntax (tmpig2u959t.py, line 39)
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
2024-04-22 11:57:48.912123: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2024-04-22 11:57:48.924943: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2000000000 Hz
WARNING:tensorflow:AutoGraph could not transform <function shift_sequence at 0x7f4979e6e9d0> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: invalid syntax (tmpb9mvb1lr.py, line 25)
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
Traceback (most recent call last):
  File "/ei/.project-scratch/0/0c7dc7bf-67f2-4ebb-ae07-2d34c4b403df/basenji/bin/basenji_train.py", line 182, in <module>
    main()
  File "/ei/.project-scratch/0/0c7dc7bf-67f2-4ebb-ae07-2d34c4b403df/basenji/bin/basenji_train.py", line 174, in main
    seqnn_trainer.fit_tape(seqnn_model)
  File "/ei/.project-scratch/0/0c7dc7bf-67f2-4ebb-ae07-2d34c4b403df/basenji/basenji/trainer.py", line 543, in fit_tape
    train_r.reset_states()
  File "/opt/software/mamba_basenji/lib/python3.9/site-packages/tensorflow/python/keras/metrics.py", line 253, in reset_states
    K.batch_set_value([(v, 0) for v in self.variables])
  File "/opt/software/mamba_basenji/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/opt/software/mamba_basenji/lib/python3.9/site-packages/tensorflow/python/keras/backend.py", line 3706, in batch_set_value
    x.assign(np.asarray(value, dtype=dtype(x)))
  File "/opt/software/mamba_basenji/lib/python3.9/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 888, in assign
    raise ValueError(
ValueError: Cannot assign to variable count:0 due to variable shape (3,) and value shape () are incompatible

This is the output of the job before it failed:

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
sequence (InputLayer)           [(None, 131072, 4)]  0                                            
__________________________________________________________________________________________________
stochastic_reverse_complement ( ((None, 131072, 4),  0           sequence[0][0]                   
__________________________________________________________________________________________________
stochastic_shift (StochasticShi (None, 131072, 4)    0           stochastic_reverse_complement[0][
__________________________________________________________________________________________________
tf.nn.gelu (TFOpLambda)         (None, 131072, 4)    0           stochastic_shift[0][0]           
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 131072, 64)   3840        tf.nn.gelu[0][0]                 
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 131072, 64)   256         conv1d[0][0]                     
__________________________________________________________________________________________________
max_pooling1d (MaxPooling1D)    (None, 16384, 64)    0           batch_normalization[0][0]        
__________________________________________________________________________________________________
tf.nn.gelu_1 (TFOpLambda)       (None, 16384, 64)    0           max_pooling1d[0][0]              
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 16384, 64)    20480       tf.nn.gelu_1[0][0]               
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 16384, 64)    256         conv1d_1[0][0]                   
__________________________________________________________________________________________________
max_pooling1d_1 (MaxPooling1D)  (None, 4096, 64)     0           batch_normalization_1[0][0]      
__________________________________________________________________________________________________
tf.nn.gelu_2 (TFOpLambda)       (None, 4096, 64)     0           max_pooling1d_1[0][0]            
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 4096, 72)     23040       tf.nn.gelu_2[0][0]               
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 4096, 72)     288         conv1d_2[0][0]                   
__________________________________________________________________________________________________
max_pooling1d_2 (MaxPooling1D)  (None, 1024, 72)     0           batch_normalization_2[0][0]      
__________________________________________________________________________________________________
tf.nn.gelu_3 (TFOpLambda)       (None, 1024, 72)     0           max_pooling1d_2[0][0]            
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, 1024, 32)     6912        tf.nn.gelu_3[0][0]               
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 1024, 32)     128         conv1d_3[0][0]                   
__________________________________________________________________________________________________
tf.nn.gelu_4 (TFOpLambda)       (None, 1024, 32)     0           batch_normalization_3[0][0]      
__________________________________________________________________________________________________
conv1d_4 (Conv1D)               (None, 1024, 72)     2304        tf.nn.gelu_4[0][0]               
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 1024, 72)     288         conv1d_4[0][0]                   
__________________________________________________________________________________________________
dropout (Dropout)               (None, 1024, 72)     0           batch_normalization_4[0][0]      
__________________________________________________________________________________________________
add (Add)                       (None, 1024, 72)     0           max_pooling1d_2[0][0]            
                                                                 dropout[0][0]                    
__________________________________________________________________________________________________
tf.nn.gelu_5 (TFOpLambda)       (None, 1024, 72)     0           add[0][0]                        
__________________________________________________________________________________________________
conv1d_5 (Conv1D)               (None, 1024, 32)     6912        tf.nn.gelu_5[0][0]               
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 1024, 32)     128         conv1d_5[0][0]                   
__________________________________________________________________________________________________
tf.nn.gelu_6 (TFOpLambda)       (None, 1024, 32)     0           batch_normalization_5[0][0]      
__________________________________________________________________________________________________
conv1d_6 (Conv1D)               (None, 1024, 72)     2304        tf.nn.gelu_6[0][0]               
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 1024, 72)     288         conv1d_6[0][0]                   
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 1024, 72)     0           batch_normalization_6[0][0]      
__________________________________________________________________________________________________
add_1 (Add)                     (None, 1024, 72)     0           add[0][0]                        
                                                                 dropout_1[0][0]                  
__________________________________________________________________________________________________
tf.nn.gelu_7 (TFOpLambda)       (None, 1024, 72)     0           add_1[0][0]                      
__________________________________________________________________________________________________
conv1d_7 (Conv1D)               (None, 1024, 32)     6912        tf.nn.gelu_7[0][0]               
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 1024, 32)     128         conv1d_7[0][0]                   
__________________________________________________________________________________________________
tf.nn.gelu_8 (TFOpLambda)       (None, 1024, 32)     0           batch_normalization_7[0][0]      
__________________________________________________________________________________________________
conv1d_8 (Conv1D)               (None, 1024, 72)     2304        tf.nn.gelu_8[0][0]               
__________________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 1024, 72)     288         conv1d_8[0][0]                   
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 1024, 72)     0           batch_normalization_8[0][0]      
__________________________________________________________________________________________________
add_2 (Add)                     (None, 1024, 72)     0           add_1[0][0]                      
                                                                 dropout_2[0][0]                  
__________________________________________________________________________________________________
tf.nn.gelu_9 (TFOpLambda)       (None, 1024, 72)     0           add_2[0][0]                      
__________________________________________________________________________________________________
conv1d_9 (Conv1D)               (None, 1024, 32)     6912        tf.nn.gelu_9[0][0]               
__________________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 1024, 32)     128         conv1d_9[0][0]                   
__________________________________________________________________________________________________
tf.nn.gelu_10 (TFOpLambda)      (None, 1024, 32)     0           batch_normalization_9[0][0]      
__________________________________________________________________________________________________
conv1d_10 (Conv1D)              (None, 1024, 72)     2304        tf.nn.gelu_10[0][0]              
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 1024, 72)     288         conv1d_10[0][0]                  
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 1024, 72)     0           batch_normalization_10[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (None, 1024, 72)     0           add_2[0][0]                      
                                                                 dropout_3[0][0]                  
__________________________________________________________________________________________________
tf.nn.gelu_11 (TFOpLambda)      (None, 1024, 72)     0           add_3[0][0]                      
__________________________________________________________________________________________________
conv1d_11 (Conv1D)              (None, 1024, 32)     6912        tf.nn.gelu_11[0][0]              
__________________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 1024, 32)     128         conv1d_11[0][0]                  
__________________________________________________________________________________________________
tf.nn.gelu_12 (TFOpLambda)      (None, 1024, 32)     0           batch_normalization_11[0][0]     
__________________________________________________________________________________________________
conv1d_12 (Conv1D)              (None, 1024, 72)     2304        tf.nn.gelu_12[0][0]              
__________________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 1024, 72)     288         conv1d_12[0][0]                  
__________________________________________________________________________________________________
dropout_4 (Dropout)             (None, 1024, 72)     0           batch_normalization_12[0][0]     
__________________________________________________________________________________________________
add_4 (Add)                     (None, 1024, 72)     0           add_3[0][0]                      
                                                                 dropout_4[0][0]                  
__________________________________________________________________________________________________
tf.nn.gelu_13 (TFOpLambda)      (None, 1024, 72)     0           add_4[0][0]                      
__________________________________________________________________________________________________
conv1d_13 (Conv1D)              (None, 1024, 32)     6912        tf.nn.gelu_13[0][0]              
__________________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 1024, 32)     128         conv1d_13[0][0]                  
__________________________________________________________________________________________________
tf.nn.gelu_14 (TFOpLambda)      (None, 1024, 32)     0           batch_normalization_13[0][0]     
__________________________________________________________________________________________________
conv1d_14 (Conv1D)              (None, 1024, 72)     2304        tf.nn.gelu_14[0][0]              
__________________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 1024, 72)     288         conv1d_14[0][0]                  
__________________________________________________________________________________________________
dropout_5 (Dropout)             (None, 1024, 72)     0           batch_normalization_14[0][0]     
__________________________________________________________________________________________________
add_5 (Add)                     (None, 1024, 72)     0           add_4[0][0]                      
                                                                 dropout_5[0][0]                  
__________________________________________________________________________________________________
tf.nn.gelu_15 (TFOpLambda)      (None, 1024, 72)     0           add_5[0][0]                      
__________________________________________________________________________________________________
conv1d_15 (Conv1D)              (None, 1024, 64)     4608        tf.nn.gelu_15[0][0]              
__________________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 1024, 64)     256         conv1d_15[0][0]                  
__________________________________________________________________________________________________
dropout_6 (Dropout)             (None, 1024, 64)     0           batch_normalization_15[0][0]     
__________________________________________________________________________________________________
tf.nn.gelu_16 (TFOpLambda)      (None, 1024, 64)     0           dropout_6[0][0]                  
__________________________________________________________________________________________________
dense (Dense)                   (None, 1024, 3)      195         tf.nn.gelu_16[0][0]              
__________________________________________________________________________________________________
switch_reverse (SwitchReverse)  (None, 1024, 3)      0           dense[0][0]                      
                                                                 stochastic_reverse_complement[0][
==================================================================================================
Total params: 111,011
Trainable params: 109,235
Non-trainable params: 1,776
__________________________________________________________________________________________________
None
model_strides [128]
target_lengths [1024]
target_crops [0]
Checkpoint restored at epoch 4, optimizer iteration 1812.
Successful first step!
Epoch 4 - 570s - train_loss: 0.3677 - train_r: 0.2641 - train_r2: 0.0688 - valid_loss: 0.3551 - valid_r: 0.3145 - valid_r2: 0.0918 - best!

I've spoken to our computing team and they don't think it's an issue with the install. I was just wondering if you had any insight into what might be causing this error? I am not familiar with Tensorflow so I'm not sure if this is an issue with the way I'm trying to run Basenji.

I'd really appreciate any help or guidance! Happy to provide any further info if required.

Many thanks, Jess

davek44 commented 7 months ago

Hi Jess, I'm not exactly sure what's going on there. We've moved on to a new codebase here: https://github.com/calico/baskerville, where we're continuing to actively develop and follow better software engineering practices. I'd recommend jumping over and trying your application there. Reach out if you get stuck, and we'll try to help.

jesspeers commented 7 months ago

Thank you! I'll give that a go