Closed RavinSG closed 5 years ago
It seems like more layers doesn't help much, but i didn't try to do a comparation. It's more about fitting a model into VRAM during training. With more layers you will have to use less number of tokens (smaller dictionary) or lower batch size.
Hello, I'm currently training the chatbot with about 2M of pairs. I was wondering whether someone has tried to compare the output of the bot with how wide or deep the nn is. I kind of went through all the posts regarding the git and almost all of them had only used 2 layers. Is it because they have come to a conclusion that 2 layers is the optimal solution or haven't tried to experiment with it yet?