BobMcDear / neural-network-cuda

Neural network from scratch in CUDA/C++
GNU General Public License v3.0
68 stars 14 forks source link

Segmentation fault #2

Closed EricWu23 closed 2 years ago

EricWu23 commented 2 years ago

Hi Borna @BobMcDear , Thanks for the amazing project and detailed documentation. I tried to run your project on my computer. While all the unit tests went well, when I try to run the main.cu in GPU folder, I did get the segmentation fault. I reduced the network to only linear layer and relu. It passed without fault. However, the training on GPU is much slower than the CPU version.

Here is my suspicion, the most confusing part of the code to me is the memory management in the sequential_forward_gpu() function. there, after getting the output size of each layer by "sz_out = layer->sz_out;" , we use cudaMallocManaged(&curr_out, sz_out*sizeof(float)); to allocate memory, however, I didn't see anywhere this memory being allocated inside for loop got released by cudaFree().

My understanding is that you have to let those memory allocated stay there because you need to store all outputs at each layer during forward propagation, which will be later on used in backward propagation to compute gradients. However, this does raise my concern on the potential memory leakage. As we have more layers, it will just allocate more memory and it allocate memory every time we call layer->update() inside sequetial_update_gpu(). What makes it worse is that, when we are looping through 100 epoch, for every new iteration, it will allocate new memory. Question 1: Am I misunderstanding something? this is just my suspicion. Let me know if you had a different design idea. Question 2: Also, do you know in general how pytorch or something similar manage the temporary memory needed to backward propagation. It seems to me now that this could be a nightmare for deep NN. Question 3: I kind of understand cudaMallocManaged gives you ability to allocate a chunk of memory that can be accessed from host and device using the same pointer. And perhaps the memory copy on/off device is handled by itself. Is my understanding correct? Thanks.

Some additional notes after debugging, if I change the n_hidden to be n_in/10;, I can get deeper in terms of epoch. Meaning instead of fail at the first epoch, I would get segmentation fault on the 4th or 5th epoch.

BobMcDear commented 2 years ago

Hello Yujiang,

Thank you for the kind words, I am glad to hear you find this project helpful.

Yes, you are right, the code maintains the intermediate values of the neural network, however, that is not the problem. The segmentation fault stems from not freeing inputs & outputs of the layers after every epoch, as you mention, which may be fixed via running the following at the end of each epoch.

for (int i=0; i<seq.layers.size(); i++){
            Module *layer = seq.layers[i];
            cudaFree(layer->out);
}

I was aware of this issue, but this program's purpose was teaching the basics of CUDA, thus no emphasis was placed on time or memory handling.

This is also how PyTorch functions: The outputs of the modules are kept for performing backpropagation, but they are not retained for the entire duration of training and are replaced at every pass.

You are correct in your understanding of cudaMallocManaged.

Does that answer your questions?

EricWu23 commented 2 years ago

Thanks for the answers.

After adding the code you suggested, I still encounter the Segmentation fault. image I feel that at least there are two more place I free weird. image Here, I think we should pass Sequential_GPU Object by reference, otherwise, it will create a new copy of the seq that was created in the main() of main.cu.

Another thing is that I added the seq.layers.back()->out = mse.inp because I assume mse.inp will contain dJ/dy_hat which is the partial derivative of loss against output. However, I don't now where you link dJ/dyhat to out of seq.layers.back(). Because it looks like to me that the update function of linear layer assumes that the out is the gradient of loss with respect to the layer's output. But maybe it is not necessary since after mse.forward(seq.layers.back()->out, targ);, the seq.layers.back()->out and mse.inp technically points to the same location.

But I am still getting segment fault after these change. Let me know if you disagree with my change.

EricWu23 commented 2 years ago

I think I kind of know what happened. According to https://github.com/Jittor/jittor/issues/35, I got the CUDA error code=702. Then I reduced the batch size and it works without the segment fault. Probably just because that my machine is not powerful enough.

But on the other hand, I got very unintuitive result. The CPU seems to be much faster than GPU when I set bs=10000, n_in=50, n_epochs=100. The CPU training time took 16.3755 sec while the GPU took 191.297 sec. The final loss is about the same, 0.00387839 for both. Do you have a guess on why this could happen? My suspicious is that due to the modular implementation and the used of cudaMallocManaged, the data is moving on/off the device during the middle stage of the training. But I am not sure what could cause that. By looking at your code, I think you are just copying pointers, which should not cause the actual data to move from device to host.
How was your timing result look like? Thanks.

BobMcDear commented 2 years ago

It is strange that you were forced to reduce the batch size, this program occupies less than 300 megabytes of memory. Yes, some of the objects could have been passed by reference instead of value, but that wouldn't too greatly impact the memory usage, particularly since the network is small. What are your specs?

Regarding linking the gradients to seq.layers.back(), it is managed in the backward pass of mse, where its input, which is the output of seq.layers.back(), is replaced by gradients. Ergo, there is no need to manually connect them.

Training with your settings gives me vastly different results - GPU training takes 2.71 seconds, whereas CPU training takes 11.38 seconds. Have you made changes to the code? I am running the original version plus the part for freeing memory after every epoch.

EricWu23 commented 2 years ago

Hi Borna @BobMcDear, Thank you for your information! I tried a pure linux machine on Vocarem lab. It works like what you described. Also, I wasn't forced to reduce the batch size. With bs=100000, n_in=50, n_epochs=100, this is what I got. image

image

For parallel computing the 100000 training samples, we got more than 10x acceleration in training time!

I was using the WSL2 (Windows Subsystem for Linux 2) crap....... According to https://forums.developer.nvidia.com/t/cuda-basic-tutorial-segmentation-fault-in-wsl2-ubuntu/161456, this WSL2 might be not stable at all at this moment. Though I seem to be already using the 21H2 build.

Edition Windows 10 Home Version 21H2 Installed on ‎2022/‎2/‎23 OS build 19044.1586 Experience Windows Feature Experience Pack 120.2212.4170.0

Anyway, I should probably spin up a linux machine on AWS at the first place.

Thank you for creating such an amazing project. I enjoyed it and the whole debugging process!