KennthShang / PhaBOX

Local version of the phage identification and analysis web server (tool set)
https://phage.ee.cityu.edu.hk/
Academic Free License v3.0
27 stars 2 forks source link

RuntimeError: CUDA out of memory issue #11

Open WUD2018 opened 6 months ago

WUD2018 commented 6 months ago

Could you please help me with this issue here?

Here is my code:

python /home/disk1/PhaBOX/main.py --contigs test_contigs.fa --threads 20 --rootpth ./tst.PhaBOX --dbdir /home/disk1/PhaBOX/database --parampth /home/disk1/PhaBOX/parameters/ --scriptpth /home/disk1/PhaBOX/scripts

Here is the nohup.output:

Using parallelized prodigal... Running prodigal... Running Diamond...

0%| | 0/1 [00:00<?, ?ba/s] 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1/1 [00:00<00:00, 103.75ba/s] The following columns in the test set don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. Running Prediction Num examples = 2 Batch size = 32

0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/disk1/PhaBOX/main.py", line 382, in out = cnn(val_feature) File "/home/disk1/Anaconda3/envs/phabox/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/disk1/PhaBOX/models/CAPCNN.py", line 30, in forward x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1] File "/home/disk1/PhaBOX/models/CAPCNN.py", line 30, in x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1] File "/home/disk1/Anaconda3/envs/phabox/lib/python3.9/site-packages/torch/nn/functional.py", line 1299, in relu result = torch.relu(input) RuntimeError: CUDA out of memory. Tried to allocate 166.00 MiB (GPU 0; 1.95 GiB total capacity; 732.96 MiB already allocated; 148.94 MiB free; 758.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1/1 [00:00<00:00, 1.35it/s]

KennthShang commented 6 months ago

This is because your gpu device is not suitable (out-of-date) to run the deep learning model. Please try to turn off your gpus and rerun the codes.

One simplest way is to change the code: image into device = torch.device("cpu").

Best, Jiayu

WUD2018 commented 6 months ago

This is because your gpu device is not suitable (out-of-date) to run the deep learning model. Please try to turn off your gpus and rerun the codes.

One simplest way is to change the code: image into device = torch.device("cpu").

Best, Jiayu

Now my PhaBOX can complete the diamond process, but appears to bump into another issue while generating the graph

---------------------------Generating Knowledge graph--------------------------- adj: (1168, 1168) features: (1168, 512) y: (1168,) (1168,) mask: (1168,) (1168,) /home/disk1/PhaBOX/main.py:706: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations train_mask = torch.from_numpy(trainmask.astype(np.bool)).to(device) /home/disk1/PhaBOX/main.py:708: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations test_mask = torch.from_numpy(test_mask.astype(np.bool)).to(device) x : tensor(indices=tensor([[ 0, 0, 0, ..., 1167, 1167, 1167], [ 511, 505, 500, ..., 36, 17, 3]]), values=tensor([0.0010, 0.0080, 0.0148, ..., 0.0370, 0.0564, 0.0093]), size=(1168, 512), nnz=53631, layout=torch.sparse_coo) sp: tensor(indices=tensor([[ 0, 1, 2, ..., 1165, 1166, 1167], [ 0, 0, 0, ..., 1167, 1167, 1167]]), values=tensor([0.0068, 0.0074, 0.0065, ..., 0.1111, 0.1111, 0.1111]), size=(1168, 1168), nnz=71924, layout=torch.sparse_coo) input dim: 512 output dim: 19 num_features_nonzero: 53631 /home/disk1/PhaBOX/models/PhaGCN.py:23: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at /opt/conda/conda-bld/pytorch_1639180487213/work/aten/src/ATen/native/IndexingUtils.h:30.) i = i[:, dropout_mask] /home/disk1/PhaBOX/models/PhaGCN.py:24: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at /opt/conda/conda-bld/pytorch_1639180487213/work/aten/src/ATen/native/IndexingUtils.h:30.) v = v[dropout_mask] Traceback (most recent call last): File "/home/disk1/PhaBOX/main.py", line 735, in loss = masked_loss(out, train_label, train_mask) File "/home/disk1/PhaBOX/scripts/ulity.py", line 77, in masked_loss loss = F.cross_entropy(out, label, w, reduction='none') File "/home/disk1/Anaconda3/envs/phabox/lib/python3.9/site-packages/torch/nn/functional.py", line 2846, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument weight in method wrapper_nll_loss_forward)

Iv checked the ulity.py in the 'script' at Line 77, which reads something related to 'cuda' again. Should I revise them and by how? Thx

by the way, is there a way to turn off the cuda in the phabox conda environment?

KennthShang commented 6 months ago

I found the reason that there is a place I did not change in the scripts/ulity.py: image

and you also need to change the main.py: image

Sure, you may need to reinstall the pytorch in your environment. But I am not sure whether there are other kinds of problems when fixing the environment. Try:

conda uninstall pytorch
conda install -c conda-forge pytorch-cpu

Best, Jiayu

WUD2018 commented 6 months ago

Thanks alot. It is now running successfully!

Wangdongyang12 commented 3 months ago

Could you please help me with this issue here? here is my result and log: ########################### result ############################################ this is only have three result of the contigs: phagcn_prediction.csv phamer_prediction.csv phatyp_prediction.csv

wc -l *.csv 39215 phagcn_prediction.csv 114305 phamer_prediction.csv 39215 phatyp_prediction.csv 192735 total ########################### log ############################################# Traceback (most recent call last): File "/home/wangdongyang/data/teacher_linyan/virsortv2/05.phabox/PhaBOX-main/main.py", line 1170, in node2label[node] = prokaryote_df[prokaryote_df['Accession'] == crispr_pred[node]]['Species'].values[0] IndexError: index 0 is out of bounds for axis 0 with size 0 100%|██████████| 1226/1226 [40:09:16<00:00, 117.91s/it] ############################################################################# By the way, my other files can generate normal result files, so what could be the reason for this

ls -lhtr total 39M 633K 3月 31 13:03 phamer_prediction.csv 248K 3月 31 13:03 phatyp_prediction.csv 239K 3月 31 13:03 phagcn_prediction.csv 326K 3月 31 13:03 cherry_prediction.csv 1.8M 3月 31 13:03 phagcn_edge.csv 91K 3月 31 13:03 phagcn_node.csv 2.8M 3月 31 13:03 cherry_edge.csv 295K 3月 31 13:03 cherry_node.csv 20M 3月 31 13:03 significant_proteins.fa 14M 3月 31 13:03 blast_results.tab