NVIDIA-Genomics-Research / AtacWorks

Deep learning based processing of Atac-seq data
https://clara-parabricks.github.io/AtacWorks/
Other
128 stars 23 forks source link

main.py infer: error: argument --gpu: invalid int value: 'None' #153

Closed feefee20 closed 4 years ago

feefee20 commented 4 years ago

Hello,

I am sorry but I don't know where/whom I can ask, so just leave some questions again here (It would be great for me to know the contact info like email address). I got some issues while I was trying both tutorial 1 and 2 as below:

1) Here's the log when I ran the following command at step7 of tutorial 1. It just finished without any error or warning. I don't still get final files that you expect after it's done. Is is still because of inappropriate memory or gpu setting (please see the following setting)?

export LSF_DOCKER_NETWORK=host
export LSF_DOCKER_IPC=host
export LSF_DOCKER_SHM_SIZE=3g
bsub -G compute-yooa -Is -q general-interactive -gpu "num=4:gmodel=TeslaV100_SXM2_32GB" -R 'rusage[mem=64GB]' -M 64GB -a 'docker(claraomics/atacworks)' /bin/bash
$atacworks/scripts/main.py train --config train_config.yaml --config_mparams model_structure.yaml --files_train Mono.50.2400.train.h5 --val_files Mono.50.2400.val.h5

INFO:2020-05-13 16:37:02,912:AtacWorks-main] Running on GPU: 0
Building model: resnet ...
Finished building.
Saving config file to ./trained_models_2020.05.13_16.37/configs/model_structure.yaml...
Num_batches 500; rank 0, gpu 0
Epoch [ 0/25] -------------------- [  0/500] mse:  20.142 | pearsonloss:   0.986 | total_loss:   1.603 | bce:   0.607
Epoch [ 0/25] ##------------------ [ 50/500] mse:2030.479 | pearsonloss:   0.055 | total_loss:   1.578 | bce:   0.507
Epoch [ 0/25] ####---------------- [100/500] mse:  20.136 | pearsonloss:   0.986 | total_loss:   1.079 | bce:   0.083
Epoch [ 0/25] ######-------------- [150/500] mse:1936.060 | pearsonloss:   0.015 | total_loss:   1.054 | bce:   0.071
Epoch [ 0/25] ########------------ [200/500] mse:  20.369 | pearsonloss:   0.986 | total_loss:   1.109 | bce:   0.112
Epoch [ 0/25] ##########---------- [250/500] mse:  58.329 | pearsonloss:   0.012 | total_loss:   0.100 | bce:   0.058
Epoch [ 0/25] ############-------- [300/500] mse:  20.124 | pearsonloss:   0.983 | total_loss:   1.079 | bce:   0.086
Epoch [ 0/25] ##############------ [350/500] mse:  42.750 | pearsonloss:   0.006 | total_loss:   0.080 | bce:   0.052
Epoch [ 0/25] ################---- [400/500] mse:  20.125 | pearsonloss:   0.983 | total_loss:   1.079 | bce:   0.086
Epoch [ 0/25] ##################-- [450/500] mse: 162.624 | pearsonloss:   0.010 | total_loss:   0.151 | bce:   0.060
Epoch [ 0/25] #################### [499/500] mse:   1.762 | pearsonloss:   0.955 | total_loss:   0.990 | bce:   0.034
Epoch [ 0/25] Time Taken: 956.095s
Total train time: 956.095   For time: 931.616   Back time: 3.983    Print time: 17.534  Remain (data) time: 2.962
Eval for 20 batches
Inference -------------------- [ 0/20] 
Evaluating on 50000 points.
Evaluation result: mse:27.3667 | corrcoef: 0.1628 | bce: 0.1126 | recall: 0.2045 | specificity: 0.9811 | auroc: 0.9592
Evaluation time taken:  26.184s
New best metric found - auroc: 0.9592
Saving model ckpt to ./trained_models_2020.05.13_16.37/epoch0_None...
Saving best model to ./trained_models_2020.05.13_16.37/model_best.pth.tar...
Num_batches 500; rank 0, gpu 0
Epoch [ 1/25] -------------------- [  0/500] mse: 362.712 | pearsonloss:   0.511 | total_loss:   0.864 | bce:   0.172
Epoch [ 1/25] ##------------------ [ 50/500] mse: 127.088 | pearsonloss:   0.572 | total_loss:   0.733 | bce:   0.098
Epoch [ 1/25] ####---------------- [100/500] mse: 212.383 | pearsonloss:   0.205 | total_loss:   0.457 | bce:   0.146
Epoch [ 1/25] ######-------------- [150/500] mse: 111.377 | pearsonloss:   0.711 | total_loss:   0.848 | bce:   0.081
Epoch [ 1/25] ########------------ [200/500] mse: 221.441 | pearsonloss:   0.203 | total_loss:   0.449 | bce:   0.135
Epoch [ 1/25] ##########---------- [250/500] mse: 105.753 | pearsonloss:   0.706 | total_loss:   0.835 | bce:   0.076
Epoch [ 1/25] ############-------- [300/500] mse: 215.121 | pearsonloss:   0.200 | total_loss:   0.423 | bce:   0.116
Epoch [ 1/25] ##############------ [350/500] mse:  98.106 | pearsonloss:   0.682 | total_loss:   0.809 | bce:   0.078
Epoch [ 1/25] ################---- [400/500] mse: 156.059 | pearsonloss:   0.205 | total_loss:   0.390 | bce:   0.106
Epoch [ 1/25] ##################-- [450/500] mse:  93.414 | pearsonloss:   0.682 | total_loss:   0.800 | bce:   0.071
Epoch [ 1/25] #################### [499/500] mse:  38.004 | pearsonloss:   0.494 | total_loss:   0.542 | bce:   0.030
Epoch [ 1/25] Time Taken: 954.789s
Total train time: 954.789   For time: 931.293   Back time: 3.849    Print time: 17.520  Remain (data) time: 2.126
Eval for 20 batches
Inference -------------------- [ 0/20] 
Evaluating on 50000 points.
Evaluation result: mse:76.8793 | corrcoef: 0.5468 | bce: 0.0542 | recall: 0.4850 | specificity: 0.9834 | auroc: 0.9712
Evaluation time taken:  25.658s
New best metric found - auroc: 0.9712
Saving model ckpt to ./trained_models_2020.05.13_16.37/epoch1_None...
Saving best model to ./trained_models_2020.05.13_16.37/model_best.pth.tar...
.   .   . 
Evaluation time taken:  28.959s
Saving model ckpt to ./trained_models_2020.05.13_16.37/epoch23_None...
Num_batches 500; rank 0, gpu 0
Epoch [24/25] -------------------- [  0/500] mse:  18.866 | pearsonloss:   0.999 | total_loss:   1.137 | bce:   0.129
Epoch [24/25] ##------------------ [ 50/500] mse:  58.982 | pearsonloss:   0.216 | total_loss:   0.303 | bce:   0.057
Epoch [24/25] ####---------------- [100/500] mse:  20.399 | pearsonloss:   0.993 | total_loss:   1.102 | bce:   0.099
Epoch [24/25] ######-------------- [150/500] mse:  36.769 | pearsonloss:   0.171 | total_loss:   0.229 | bce:   0.040
Epoch [24/25] ########------------ [200/500] mse:  19.917 | pearsonloss:   0.992 | total_loss:   1.100 | bce:   0.098
Epoch [24/25] ##########---------- [250/500] mse:  32.314 | pearsonloss:   0.167 | total_loss:   0.223 | bce:   0.040
Epoch [24/25] ############-------- [300/500] mse:  19.569 | pearsonloss:   0.991 | total_loss:   1.098 | bce:   0.098
Epoch [24/25] ##############------ [350/500] mse:  31.522 | pearsonloss:   0.163 | total_loss:   0.217 | bce:   0.039
Epoch [24/25] ################---- [400/500] mse:  19.408 | pearsonloss:   0.989 | total_loss:   1.096 | bce:   0.098
Epoch [24/25] ##################-- [450/500] mse:  30.666 | pearsonloss:   0.161 | total_loss:   0.215 | bce:   0.038
Epoch [24/25] #################### [499/500] mse:   8.531 | pearsonloss:   0.861 | total_loss:   0.906 | bce:   0.040
Epoch [24/25] Time Taken: 957.365s
Total train time: 957.365   For time: 932.471   Back time: 4.490    Print time: 17.516  Remain (data) time: 2.888
Eval for 20 batches
Inference -------------------- [ 0/20] 
Evaluating on 50000 points.
Evaluation result: mse:11.2150 | corrcoef: 0.5584 | bce: 0.1608 | recall: 0.2936 | specificity: 0.9997 | auroc: 0.4045
Evaluation time taken:  25.740s
Saving model ckpt to ./trained_models_2020.05.13_16.37/epoch24_None...

2) This is another error when I tried step 7 of tutorial 2. I got the same errors regardless of using --file_sizes or --sizes_file. Would you please let me know what's wrong or how to fix it? Thank you so much for your help.

[$atacworks/scripts/main.py infer     --files NK.50_cells.h5     --file_sizes $atacworks/data/reference/hg19.auto.sizes     --config configs/infer_config.yaml     --config_mparams configs/model_structure.yaml 
usage: main.py infer [-h] --label LABEL --out_home OUT_HOME --task
                     {regression,classification,both} --print_freq PRINT_FREQ
                     --bs BS --num_workers NUM_WORKERS --pad PAD --transform
                     {log,None} --layers LAYERS --weights_path WEIGHTS_PATH
                     --gpu GPU [--distributed] --dist-url DIST_URL
                     --dist-backend DIST_BACKEND [--debug] [--config CONFIG]
                     --files FILES --intervals_file INTERVALS_FILE
                     --sizes_file SIZES_FILE --infer_threshold INFER_THRESHOLD
                     --reg_rounding REG_ROUNDING --cla_rounding CLA_ROUNDING
                     --batches_per_worker BATCHES_PER_WORKER [--gen_bigwig]
                     --result_fname RESULT_FNAME [--deletebg]
main.py infer: error: argument --gpu: invalid int value: 'None'](url)
ntadimeti commented 4 years ago

@wookyung lets use Issue #175 for all the errors related to you using your own custom data.