RF5 / danbooru-pretrained

Pretrained pytorch models for the Danbooru2018 dataset
Other
173 stars 16 forks source link

Training Issues #1

Closed Dakini closed 4 years ago

Dakini commented 5 years ago

Hi,

I am currently trying to train a model using Danbooru, and your 6000_tags csv file. However, I am getting some random Run Time errors. DataLoader worker (pid 17154) is killed by signal: Segmentation fault.

Have you had these issues before?

RF5 commented 4 years ago

Hi, sorry for the late reply.

Yes, I did have this error one or two times, and they mainly arose from the size of the labels causing problems. The 6000_tag_labels.csv is over 1GB, so if one doesn't be a bit careful about how they load it into their training code it can cause problems. For example, if you are using the same fastai method that I used for training, you will likely require quite a lot of RAM (>20GB) to load the data in.

This is because (after many crashes on my side during training) I realized internally fastai attempts to add all the labels into a python set datatype (to find all the unique labels), and doing this for a 1.1GB csv file with 6000 unique tags is super memory intensive it seems. Luckily once it does this initial loading, the RAM usage drops quite a bit. So I found that my RAM usage spiked really high when first loading it into fastai, and then became lower and stable during training.

In short: watch your RAM usage (not GPU memory) as you load the csv into your training program, since if you run out that might be what is causing random segfault errors. In which case my suggestion is to obtain more RAM. However, it could be something very different -- the RAM issue is just what caused my segfault errors.