cnr-isti-vclab / TagLab

A CNN based image segmentation tool oriented to marine data analysis
https://taglab.isti.cnr.it/
GNU General Public License v3.0
80 stars 31 forks source link

Training Auto-segmentation #67

Open Chris-Cooney opened 1 year ago

Chris-Cooney commented 1 year ago

Hi,

I can no longer train a network. The error is in the picture below. I apologise for all the issues I'm posting but I am a PhD student and this program is meant to help me with some significant data analysis. Sometimes the training gets further sometimes it's shorter and sometimes it just doesn't work altogether. I would really appreciate the help.

Thanks, Chris

Capture

maxcorsini commented 1 year ago

Hi Chris,

this is an usual error caused by CUDA. Let's try to re-launch the training and check if the error happens again.

Best

Chris-Cooney commented 1 year ago

My best guess is that the data I’m trying to process is too large.

I have run it through multiple scenarios. Where I have limited the the pixel/mm, the area selected for export and then also the number of epochs. A final test worked but the resulting accuracy was poor.

I also ran a test with the small annotated example, exporting a training set, and training. It ran without issue. With reasonable accuracy.

if the data I’m trying to process is too large for the training to work then I’m not sure what else I can do?

maxcorsini commented 1 year ago

Hi Chris,

the amount of data influences the time to train the network but not the training procedure. So, it is quite strange that with a lot of data the training fails. Please, send us an email containing the number of tiles of the train / validation and test folders of your dataset when it works and when it does not work.

Another question, are you working remotely on a server ? Are you sure that the server does not limit the processing time in some way ?

Best

Chris-Cooney commented 1 year ago

I have moved all the processing and environments to the local drive, not on a network to see how it goes. I now receive the error below:

Capture

I'm working on gathering the training datasets

Thanks, Chris

maxcorsini commented 1 year ago

Hi Chris,

this is a bug; the new version of pytorch changed the name of the interpolation argument. I have just fixed it and do a new release. Update TagLab and all should work.

Best

Chris-Cooney commented 1 year ago

It appears that any kind of remote connection (profile, login/user) is the cause of my problems. When I run the training directly from the computer OS drive it works. thank you for your help! I am extremely grateful.

Carol-Symbiomics commented 1 year ago

Hi! I'm trying to train TagLab to speedup the segmetation problem but not completely sure I'm in the right track (the tutorial is not very clear to me).

For testing purposes what I've done is

  1. Do the manual segmentation and classification of some of my maps
  2. Then export my working areas as training dataset File >Export > Export New Training DataSet (I adjust the pixel size according to the map scale)
  3. Then while on my actual project (about 32 maps) I try to run the "Train your network" At this step for the "Dataset folder" I selected the folder where I previously exported the training datasets. I used "Nurseries" as a "Network name" and kept all the other parameters as default. Unfortunatelly at this stage I get a torch-related error (see below) and the GUI crashes.

image

I've checked the pythorch version and it is correct for CUDA toolkit 11.6 image

Could you please provide a further guidance on how to overcome this problem and let me know if the training protocol I'm followin is correct

Your help will be greatly appreciated

Sincerely,

Carol

maxcorsini commented 1 year ago

The steps done are ok.
The problem is caused by the fact that you have CUDA installed but the Pytorch version installed in your Conda environment is the CPU one.

This may happens, Python's packages are installed but not in the Conda environment in use.

If you do not need to work inside Conda, the easier thing to fix this problem is to re-install TagLab's packages by running install.py outside Conda using a command shell. Then, run TagLab outside Conda environment. Before, uninstall the current Pytorch version.

I hope this helps.

Carol-Symbiomics commented 1 year ago

Hi Massimiliano!

I did try to reinstall the dependencies outside conda to fix the problem with torch but unfortunately the installation wasn't somehow successful. When I try to run TagLab, I get an numpy error (which has been already reported by another user). I tried to fix it by reinstalling numpy and sepuptools without success. image

I remember I was able to fix the problem when working inside conda only by updating it "conda update all" but I do not know what else to do while running it from the command prompt. Any other idea?

Carol-Symbiomics commented 1 year ago

I got a last minute update! I was able to install pytorch gpu using conda. I tried again to run the Training DataSet but now the problem is that "CUDA run out of memory". Can you please tell me how can I tackle this problem?

return F.conv2d(input, weight, bias, self.stride,

RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 4.00 GiB total capacity; 3.28 GiB already allocated; 0 bytes free; 3.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_C

image

maxcorsini commented 1 year ago

Hi Carol,

the problem here is that the neural network goes out of memory during its use. We have only two options, to reduce the batch size before to launch the Train Your Network or use a computer equipped with a graphic board with more GPU RAM (e.g. 8 GB).

Hope this helps!