SineZHAN / deepALplus

This is a toolbox for Deep Active Learning, an extension from previous work https://github.com/ej0cl6/deep-active-learning (DeepAL toolbox).
MIT License
170 stars 24 forks source link

There was a memory error when I loaded the breakhis dataset #8

Closed xichenli00 closed 11 months ago

xichenli00 commented 12 months ago

hello,@SineZHAN . def get_BreakHis(handler, args_task): download data from https://www.kaggle.com/datasets/ambarish/breakhis and unzip it in data/BreakHis/ data_dir = './data/BreakHis/BreaKHis_v1/BreaKHis_v1/histology_slides/breast' data = datasets.ImageFolder(root=data_dir, transform=None).imgs train_ratio = 0.7 test_ratio = 0.3 data_idx = list(range(len(data))) random.shuffle(data_idx) train_idx = data_idx[:int(len(data) train_ratio)] test_idx = data_idx[int(len(data) train_ratio):] X_tr = [np.array(Image.open(data[i][0])) for i in train_idx] Y_tr = [data[i][1] for i in train_idx] X_te = [np.array(Image.open(data[i][0])) for i in test_idx] Y_te = [data[i][1] for i in test_idx] X_tr = np.array(X_tr, dtype=object) X_te = np.array(X_te, dtype=object) Y_tr = torch.from_numpy(np.array(Y_tr)) Y_te = torch.from_numpy(np.array(Y_te)) return Data(X_tr, Y_tr, X_te, Y_te, handler, args_task)

A memory error occurred while running this code!

MemoryError: Unable to allocate 943. KiB for an array with shape (460, 700, 3) and data type uint8

X_tr keeps storing Data, which causes the memory to be full. How can I modify it to work with class data?

SineZHAN commented 12 months ago

The array you're trying to create is relatively small (with a size of around 943 KiB), so it's unusual to encounter a memory error for such a size unless your system is heavily loaded or has very limited RAM. Maybe you can:

  1. Close Other Applications: Make sure to close any non-essential applications that might be consuming a significant amount of memory.
  2. Check System Memory Usage: Check your system's memory usage to see if it's unusually high. On Windows, you can do this using the Task Manager, and on Linux, tools like "htop" can be helpful.
  3. Run Your Code on a Different Machine: If you have access to a machine with more memory, try running your code there to see if the problem is specific to your current machine.
xichenli00 commented 11 months ago

@SineZHAN Thank you for your patience I still have some questions to ask you:

  1. After closing other software as you suggested, I found that all the images were loaded into memory. Then when I run the line "loss = F.cross_entropy(out, y)" in the training phase, the console reports

File "D:\Anaconda\envs\deepAL\lib\site-packages\torch\nn\functional.py", line 2846, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int' python-BaseException!

when I change the code "loss = F.cross_entropy(out, y)" to "loss = F.cross_entropy(out, y.long())",another error occured!

RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int'"RunntimeError:CUDNN_STATUS_EXECUTION_FAILED"

*2. What do I do if I want to use deepALPlus to train my own dataset (about 90,000 256256 images)? Can I not load the image into memory?**

SineZHAN commented 11 months ago

“RunntimeError:CUDNN_STATUS_EXECUTION_FAILED” typically indicates a problem with the execution in CUDA/cuDNN. This error can be caused by various issues, including insufficient GPU memory, incompatibility between CUDA/cuDNN versions, or compatibility issues between PyTorch and the installed CUDA/cuDNN versions. Please check: 1) ensure that your GPU has enough available memory to run your program. You can use the nvidia-smi command to monitor memory usage. If you are running out of memory, try reducing the batch size or simplifying your model. 2) Make sure that the versions of CUDA and cuDNN you have installed are compatible with your PyTorch version. This information can be found on the PyTorch official website. You can reinstall them. (To be honest, it is not recommended to run code on your own PC with Windows operating system, especially on your own dataset with 90,000 data samples, you can connect to your supervisor to let him/her provide a better server or using cloud services like colab.)

I don't know the information of your own dataset and I don't know what is your backbone model, if you can run the target task, e.g., your whole dataset can be fully trained on your backbone model (e.g., ResNet18) successfully, you can run the task on deepAL+ based on your current computational resources.

xichenli00 commented 11 months ago

I have resolved the issue mentioned earlier, but during the training process, I still encounter memory crashes, and PyCharm shuts down. Despite addressing the previously mentioned problem, I am still facing memory crashes and PyCharm shutdowns during training when using the code snippet: for batch_idx, (x, y, idxs) in enumerate(loader): Therefore, I would like to inquire if there are alternative methods to load training set images into memory only when needed, rather than loading all images directly into memory.

My training dataset consists of a total of 57,640 images with dimensions 256x256. The validation and test sets each contain 8,905 images. I am using the ModelNet V3_small as the backbone model. I have successfully implemented a binary classification task for lesion regions locally, and the model accuracy is good.

I would like to ask if there is a better approach, rather than loading all the training and validation images into memory, for example, using code like: X_tr = [np.array(Image.open(data[i][0])) for i in train_idx] X_te = [np.array(Image.open(data[i][0])) for i in test_idx] Is it possible to load images batch by batch during training to improve memory efficiency and prevent memory crashes?"

SineZHAN commented 11 months ago

Yes you need to write your own "Dataset" and "DataLoaders", can revise how you get the items like: def getitem(self, index): img_path = self.data[self.idx[index]][0] image = Image.open(img_path) if self.transform: image = self.transform(image) return image

xichenli00 commented 11 months ago

Can you provide me with some advice? For example, besides the need to modify the code in data.py and net.py, what other parts should I consider modifying? I sincerely appreciate your help!

SineZHAN commented 11 months ago

Hello, I don't know much about your project, and I have no idea of anticipating what you need to change except for dataset loading and backbone model design. You can make changes while checking to see if the results meet your expectations. Good luck!

xichenli00 commented 11 months ago

ok,Thank you very much!

xichenli00 commented 11 months ago

I'm running deepALplus on a server for remote debugging, targeting the BreakHis dataset. When executing the training script with the command "python demo -a EntropySampling -s 1000 -q 5000 -b 64 -d CIFAR10 --seed 42 -t 3 -g 0", the training process crashes around the 5th epoch out of 10, even though the server has 40GB of memory. The server configuration is "CPU: 12 cores/GPU, Xeon(R) Platinum 8255C, 40GB RAM, 10GB GPU memory (GPU 3080)".

The Code is: for epoch in tqdm(range(1, n_epoch + 1), ncols=100): for batch_idx, (x, y, idxs) in enumerate(loader): x, y = x.to(self.device), y.to(self.device) optimizer.zero_grad() out, e1 = self.clf(x) # e1--{Tensor:(128,512)} out--{Tensor:(128,10)} loss = F.cross_entropy(out, y.long().cuda()) # 修改:y-->y.long().cuda() for BreakHis loss.backward() optimizer.step()

add code ,cpu 40GB crash!

del x, y, out, e1, loss

I suspect there might be a bug. Do the variables x, y, out, e1, and loss need to be retained after their usage in the code? Can memory be released by using
del x, y, out, e1, loss
I haven't made any modifications to the original code. I would like to understand why this is happening and how to address or modify it. Could you kindly provide details about the server configuration you used when training the BreakHis dataset?

SineZHAN commented 11 months ago
  1. You only have 10GB GPU, it's small on most of the situations, but it would be enough if there is no one else uses this machine.
  2. You can watch the GPU usage by using watch -n [time interval] -d nvidia-smi to see if there is GPU memory left. Maybe it is not only you use the server.
  3. You didn't provide the error information. The program crashes for many reasons.
  4. I don't think CPU 40GB is related to the program running process.
  5. Maybe you can search the error information on Google/Bing when you encounter the error at first before asking me.
  6. The configuration is recorded in our paper, you can read it first.

If you have more questions, please contact me by email.

xichenli00 commented 11 months ago

Thank you for your patience.