Closed thomasf1 closed 4 months ago
Hi @thomasf1 I just tried the code locally and didn't have any issues. Could you add some additional logging to help debug? For example, does the code succesfuly return from this first function call?
@MidoAssran Thanks for the pointer. I´ve added a bunch of debugging and got it to work with a very small dataset. Doing some more work to see where it gets stuck...
What´s the recommendation for the validation dataset in terms of split. Also, is the validation part unsupervised, too - or does it require class ids in the dataset?
One other observation: It seems that the Data loading is done for each Epoch again, leaving the GPU not utilised for quite a long time. This might be a area that can be improved considerably.
Did you guys graph the GPU usage - this might be mitigated in the multi machine training code?
Hi @thomasf1 yes since you are running the evaluation code (training an attentive probe on top of the frozen encoder), the validation part does need a class_id in the dataset index file, as this is a supervised learning problem.
As for the efficiency, yes it's true the data loading is done in each epoch, however, since the evals run reasonably quickly compared to the pretraining since they only involve training a small probe, we didn't try optimizing this further. If you want to speed it up, one option would be to compute the embeddings of the videos in your dataset, and then just train a probe on top of those pre-extracted features.
Since you seemed to have already gotten the eval code working, i'm going to close this task for now, but feel free to comment if you have any other questions!
Hello @thomasf1 , I got the same error. Can you mention how did you resolve this? Thanks.
@thomasf1 i have the same error where the model crashes after first epoch while fine tuning the attentive probe over my custom dataset..i am running the task on a single GPU machine , also the RAM utilisation of the model exceeds 30GB's . is there a potential solution to this problem/
I´m trying to get jepa to work on Colab, but for some reason it does End/Crash after completing the first Epoch. The output folder is basically empty (one empty csv file)
The pretained model used is vitl16.pth.tar. (https://dl.fbaipublicfiles.com/jepa/vitl16/vitl16.pth.tar)
The dataset used is a bunch of mp4 videos (no class_labels / set to 0)
Could you give me some pointers on how to possibly debug this?
Environment: Colab Pro, tried it with the A100 and V100.
Start of the training with:
Output: