Open minhdov opened 2 years ago
Hi, @mondrasovic Thanks for your time. I trained with 2 GPUs with a learning rate of 0.0025, max_iter is 50000, but got inferior performance in MOT17 training set. Just kindly ask, can I reproduce the results with my limited resources? Plus, on configurations configs/DLA_34_FPN_EMM_MOT17.yaml, I found the TRAIN datasets have two files: crowdhuman_train_fbox and crowdhuman_val_fbox. Should we also add annotations of MOT17 for training? Puls, what about VIDEO_CLIPS_PER_BATCH ? Change it accordingly? Thanks
Well, training this architecture with limited resources is pretty problematic. I can say this from my experience, and trust me, my experience is not negligible. I did a substantial portion of my experiments as part of my Ph.D. in deep learning using this architecture.
So, not only did I receive additional validation from other people, but I also had to be more confident in my results since it wasn't just my hobby. Regarding the reproducibility aspect, it seems like it is impossible to achieve the very same results. Although you can get close, still no one I have discussed this issue with would consider it to be "within acceptable range".
On top of all this, another researcher with his team from the Netherlands also discussed this topic with me and they couldn't reach the same performance level. He eventually became an opponent of my dissertation thesis, just a fun fact.
As for the annotations. It is quite straightforward, I believe. Add annotations you want your model to use. If you want to train solely on MOT17
, then use just those. If you want more datasets, such as the mentioned CrowdHuman
, then do so.
This is what the configuration looks like in my case:
DATASETS:
ROOT_DIR: "../../datasets"
TRAIN: ("MOT17",)
Have a look here. I provide the important part of the code for clarity below.
dataset_maps['MOT17'] = [
'MOT17',
'anno.json',
'splits.json',
'video'
]
This tells you that the dataset key, specifically MOT17
in this case, provides all the information you need to load the annotations.
Later in the code, as demonstrated here, you can find that these values are used as follows:
dataset_folder, anno_file, split_file, modality = dataset_maps[dataset_key]
It speaks for itself. The value consists of a tuple with $4$ elements, where each of them represents dataset folder, annotation file name, splits file name, and modality, respectively.
And as far as VIDEO_CLIPS_PER_BATCH
is concerned, this is up to your hardware capabilities. It tells you how many videos to consider for each batch. For example, if you use
SOLVER:
VIDEO_CLIPS_PER_BATCH: 3
then the effective batch size is equal to $6 = 3 \times 2$, because you select $3$ videos and for each video, you need a pair of images.
Batch size is actually the culprit of having a hard time reliably reproducing the results. I experimented with gradient accumulation but to no avail, although it did help a little. So, if you are a mortal, you will probably have to stick to a single-digit number, and that is as good as you can get unless you try some powerful hardware.
Hi @mondrasovic Thank you so much for your reply, I really really appreciate that. It is really great to have someone to discuss with. Just some follow-up questions. I am sure your experience would be of great help.
My use case was aimed at vehicle tracking, not people, so it actually did not bother me that much that my results were slightly inferior. Furthermore, my experiments had to produce notable improvements in relative terms, not absolute. I actually worked with a completely different dataset as my main objective, but still, I played with MOT17 a lot, as you can imagine. I had to quantify the effect of my modifications on the underlying SiamMOT model in as many ways as possible.
MOTA 60
as well as IDF1
, but I am not sure right off the top of my head. And I did not try deeper networks since the batch size is already quite small even with DLA-34
. I can tell that the model was usable. It did not break down or whatever. So even if you utilized it in the real world, it wouldn't be a complete fail, but I did not reach those numbers as all.*.json
in a specific format, and you need a *.txt
out of it, once again, in a specific format. So, I implemented a small utility to do exactly that. Here is a source code for it - json2mot.py in *.txt
format since GitHub would not allow *.py
files. (Note: I might have made some modifications to the script, so take it with a grain of salt, I am not sure whether it works accurately).Hi @mondrasovic Thank you so much for your reply. I really appreciate it. Yeah, I am working on tracking pedestrians at present, and vehicle tracking is my future work. I am aware of the unstable things during training you mentioned in my experiments. I am not sure if I can make any significant improvement on top of that. But I will still work on that and will keep updating my results with you. The reproduction is the first step, and I will try to make some improvements based on this network, hopefully it will work. Thanks for sharing the source code and help.
Hi @mondrasovic Thank you so much for your reply. I really appreciate it. Yeah, I am working on tracking pedestrians at present, and vehicle tracking is my future work. I am aware of the unstable things during training you mentioned in my experiments. I am not sure if I can make any significant improvement on top of that. But I will still work on that and will keep updating my results with you. The reproduction is the first step, and I will try to make some improvements based on this network, hopefully it will work. Thanks for sharing the source code and help.
hi how you are trying to improve this network?
Hi @mondrasovic Thank you so much for your reply. I really appreciate it. Yeah, I am working on tracking pedestrians at present, and vehicle tracking is my future work. I am aware of the unstable things during training you mentioned in my experiments. I am not sure if I can make any significant improvement on top of that. But I will still work on that and will keep updating my results with you. The reproduction is the first step, and I will try to make some improvements based on this network, hopefully it will work. Thanks for sharing the source code and help.
hi how you are trying to improve this network?
I have no clues.
I suggest you have a look at these two issues in which the very same question is being discussed: