astramind-ai / Mixture-of-depths

Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
123 stars 7 forks source link

training question on LLama_factory #2

Closed putizi-super closed 5 months ago

putizi-super commented 5 months ago

Hello, you have provided a great job. But I myself keep having gradient explosion problems when inserting it into llama_factory training, do you know why? Among them I have modified the code as follows: image

These places may need to be modified, otherwise they cannot be trained because the length of attention_mask does not match.

And I added this to the main training code image Have you been successful in your training? What configuration does the training use?

The terminal print result is image

putizi-super commented 5 months ago

I soved it but I am confused