Closed xichenli00 closed 11 months ago
The array you're trying to create is relatively small (with a size of around 943 KiB), so it's unusual to encounter a memory error for such a size unless your system is heavily loaded or has very limited RAM. Maybe you can:
@SineZHAN Thank you for your patience I still have some questions to ask you:
File "D:\Anaconda\envs\deepAL\lib\site-packages\torch\nn\functional.py", line 2846, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int' python-BaseException!
when I change the code "loss = F.cross_entropy(out, y)" to "loss = F.cross_entropy(out, y.long())",another error occured!
RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int'"RunntimeError:CUDNN_STATUS_EXECUTION_FAILED"
*2. What do I do if I want to use deepALPlus to train my own dataset (about 90,000 256256 images)? Can I not load the image into memory?**
“RunntimeError:CUDNN_STATUS_EXECUTION_FAILED” typically indicates a problem with the execution in CUDA/cuDNN. This error can be caused by various issues, including insufficient GPU memory, incompatibility between CUDA/cuDNN versions, or compatibility issues between PyTorch and the installed CUDA/cuDNN versions. Please check: 1) ensure that your GPU has enough available memory to run your program. You can use the nvidia-smi command to monitor memory usage. If you are running out of memory, try reducing the batch size or simplifying your model. 2) Make sure that the versions of CUDA and cuDNN you have installed are compatible with your PyTorch version. This information can be found on the PyTorch official website. You can reinstall them. (To be honest, it is not recommended to run code on your own PC with Windows operating system, especially on your own dataset with 90,000 data samples, you can connect to your supervisor to let him/her provide a better server or using cloud services like colab.)
I don't know the information of your own dataset and I don't know what is your backbone model, if you can run the target task, e.g., your whole dataset can be fully trained on your backbone model (e.g., ResNet18) successfully, you can run the task on deepAL+ based on your current computational resources.
I have resolved the issue mentioned earlier, but during the training process, I still encounter memory crashes, and PyCharm shuts down. Despite addressing the previously mentioned problem, I am still facing memory crashes and PyCharm shutdowns during training when using the code snippet:
for batch_idx, (x, y, idxs) in enumerate(loader):
Therefore, I would like to inquire if there are alternative methods to load training set images into memory only when needed, rather than loading all images directly into memory.
My training dataset consists of a total of 57,640 images with dimensions 256x256. The validation and test sets each contain 8,905 images. I am using the ModelNet V3_small as the backbone model. I have successfully implemented a binary classification task for lesion regions locally, and the model accuracy is good.
I would like to ask if there is a better approach, rather than loading all the training and validation images into memory, for example, using code like:
X_tr = [np.array(Image.open(data[i][0])) for i in train_idx] X_te = [np.array(Image.open(data[i][0])) for i in test_idx]
Is it possible to load images batch by batch during training to improve memory efficiency and prevent memory crashes?"
Yes you need to write your own "Dataset" and "DataLoaders", can revise how you get the items like: def getitem(self, index): img_path = self.data[self.idx[index]][0] image = Image.open(img_path) if self.transform: image = self.transform(image) return image
Can you provide me with some advice? For example, besides the need to modify the code in data.py and net.py, what other parts should I consider modifying? I sincerely appreciate your help!
Hello, I don't know much about your project, and I have no idea of anticipating what you need to change except for dataset loading and backbone model design. You can make changes while checking to see if the results meet your expectations. Good luck!
ok,Thank you very much!
I'm running deepALplus on a server for remote debugging, targeting the BreakHis dataset. When executing the training script with the command "python demo -a EntropySampling -s 1000 -q 5000 -b 64 -d CIFAR10 --seed 42 -t 3 -g 0", the training process crashes around the 5th epoch out of 10, even though the server has 40GB of memory. The server configuration is "CPU: 12 cores/GPU, Xeon(R) Platinum 8255C, 40GB RAM, 10GB GPU memory (GPU 3080)".
The Code is: for epoch in tqdm(range(1, n_epoch + 1), ncols=100): for batch_idx, (x, y, idxs) in enumerate(loader): x, y = x.to(self.device), y.to(self.device) optimizer.zero_grad() out, e1 = self.clf(x) # e1--{Tensor:(128,512)} out--{Tensor:(128,10)} loss = F.cross_entropy(out, y.long().cuda()) # 修改:y-->y.long().cuda() for BreakHis loss.backward() optimizer.step()
add code ,cpu 40GB crash!
I suspect there might be a bug. Do the variables x, y, out, e1, and loss need to be retained after their usage in the code? Can memory be released by using
del x, y, out, e1, loss
I haven't made any modifications to the original code. I would like to understand why this is happening and how to address or modify it.
Could you kindly provide details about the server configuration you used when training the BreakHis dataset?
If you have more questions, please contact me by email.
Thank you for your patience.
hello,@SineZHAN . def get_BreakHis(handler, args_task): download data from https://www.kaggle.com/datasets/ambarish/breakhis and unzip it in data/BreakHis/ data_dir = './data/BreakHis/BreaKHis_v1/BreaKHis_v1/histology_slides/breast' data = datasets.ImageFolder(root=data_dir, transform=None).imgs train_ratio = 0.7 test_ratio = 0.3 data_idx = list(range(len(data))) random.shuffle(data_idx) train_idx = data_idx[:int(len(data) train_ratio)] test_idx = data_idx[int(len(data) train_ratio):] X_tr = [np.array(Image.open(data[i][0])) for i in train_idx] Y_tr = [data[i][1] for i in train_idx] X_te = [np.array(Image.open(data[i][0])) for i in test_idx] Y_te = [data[i][1] for i in test_idx] X_tr = np.array(X_tr, dtype=object) X_te = np.array(X_te, dtype=object) Y_tr = torch.from_numpy(np.array(Y_tr)) Y_te = torch.from_numpy(np.array(Y_te)) return Data(X_tr, Y_tr, X_te, Y_te, handler, args_task)
A memory error occurred while running this code!
X_tr keeps storing Data, which causes the memory to be full. How can I modify it to work with class data?