Closed mikelee-dev closed 3 weeks ago
It appears that, this is triggered by self.encode_image(image)
, which is sometimes resulting in a tensor of nans
, although image
does not contain nans
That's strange, i didn't encounter that problem. Perhaps you may check whether you download ShareGPT4V completely and rerun the training code. If it still occurs, you may add a try... except....
to avoid that since it rarely happens
Hello, I apologize for the interruption. While fine-tuning my dataset, I encountered the error torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 63)
. After some debugging, I found that this issue occurred after using the AdamW optimizer, resulting in the image
tensor becoming NaN. However, switching to the SGD optimizer resolved the issue. I am not sure what caused this problem. If you have any insights, I would greatly appreciate your help!
does anyone else get a similar issue during training?