How should we change the siam-mot code and configuration to train multiple class tracking?

mondrasovic commented 2 years ago

I suggest you have a look at these two issues in which the very same question is being discussed:

Leo63963 commented 2 years ago

Hi, @mondrasovic Thanks for your time. I trained with 2 GPUs with a learning rate of 0.0025, max_iter is 50000, but got inferior performance in MOT17 training set. Just kindly ask, can I reproduce the results with my limited resources? Plus, on configurations configs/DLA_34_FPN_EMM_MOT17.yaml, I found the TRAIN datasets have two files: crowdhuman_train_fbox and crowdhuman_val_fbox. Should we also add annotations of MOT17 for training? Puls, what about VIDEO_CLIPS_PER_BATCH ? Change it accordingly? Thanks

mondrasovic commented 2 years ago

Well, training this architecture with limited resources is pretty problematic. I can say this from my experience, and trust me, my experience is not negligible. I did a substantial portion of my experiments as part of my Ph.D. in deep learning using this architecture.

So, not only did I receive additional validation from other people, but I also had to be more confident in my results since it wasn't just my hobby. Regarding the reproducibility aspect, it seems like it is impossible to achieve the very same results. Although you can get close, still no one I have discussed this issue with would consider it to be "within acceptable range".

On top of all this, another researcher with his team from the Netherlands also discussed this topic with me and they couldn't reach the same performance level. He eventually became an opponent of my dissertation thesis, just a fun fact.

As for the annotations. It is quite straightforward, I believe. Add annotations you want your model to use. If you want to train solely on MOT17, then use just those. If you want more datasets, such as the mentioned CrowdHuman, then do so.

This is what the configuration looks like in my case:

DATASETS:
  ROOT_DIR: "../../datasets"
  TRAIN: ("MOT17",)

Have a look here. I provide the important part of the code for clarity below.

dataset_maps['MOT17'] = [
    'MOT17',
    'anno.json',
    'splits.json',
    'video'
]

This tells you that the dataset key, specifically MOT17 in this case, provides all the information you need to load the annotations.

Later in the code, as demonstrated here, you can find that these values are used as follows:

dataset_folder, anno_file, split_file, modality = dataset_maps[dataset_key]

It speaks for itself. The value consists of a tuple with $4$ elements, where each of them represents dataset folder, annotation file name, splits file name, and modality, respectively.

And as far as VIDEO_CLIPS_PER_BATCH is concerned, this is up to your hardware capabilities. It tells you how many videos to consider for each batch. For example, if you use

SOLVER:
  VIDEO_CLIPS_PER_BATCH: 3

then the effective batch size is equal to $6 = 3 \times 2$, because you select $3$ videos and for each video, you need a pair of images.

Batch size is actually the culprit of having a hard time reliably reproducing the results. I experimented with gradient accumulation but to no avail, although it did help a little. So, if you are a mortal, you will probably have to stick to a single-digit number, and that is as good as you can get unless you try some powerful hardware.

Leo63963 commented 2 years ago

Hi @mondrasovic Thank you so much for your reply, I really really appreciate that. It is really great to have someone to discuss with. Just some follow-up questions. I am sure your experience would be of great help.

Since you have the same hardware (Two RTX2080Ti) with me, could you please tell me how close you reached on MOT17 based on the provided code (as far as I know, the models provided by the author could obtain MOTA=64.1, IDF1=63.6 on MOT17 training set with public setting)? Have you tried some deeper networks? Such as DLA-169 and DLA-169-DCN.
As for the reproduction, in the corresponding published paper, the author says "we train our model for 25K and 50K iterations, for Crowdhuman and COCO datasets respectively ". And the author also provides some pre-trained Faster R-CNN models that are trained on COCO only in here. Therefore, I suppose if we trained on Crowdhuman first for 25k iterations with "crowdhuman_train_fbox" and "crowdhuman_val_fbox" you provide (maybe we should train more iterations with lower batch size). And then we continue to train the model on MOT17 (with annotations provided by the author) for 50k iterations, we may obtain better results. Have you tried this?
Just kindly ask, since the code outputs the result in JSON formation. How to change it into TXT for submitting to the MOTChallenge Benchmark? Thanks.

mondrasovic commented 2 years ago

My use case was aimed at vehicle tracking, not people, so it actually did not bother me that much that my results were slightly inferior. Furthermore, my experiments had to produce notable improvements in relative terms, not absolute. I actually worked with a completely different dataset as my main objective, but still, I played with MOT17 a lot, as you can imagine. I had to quantify the effect of my modifications on the underlying SiamMOT model in as many ways as possible.

As for the MOT17 challenge, I was around MOTA 60 as well as IDF1, but I am not sure right off the top of my head. And I did not try deeper networks since the batch size is already quite small even with DLA-34. I can tell that the model was usable. It did not break down or whatever. So even if you utilized it in the real world, it wouldn't be a complete fail, but I did not reach those numbers as all.
I did play with it, and I do not remember any significant results. However, what I do recall is how unstable is training on detection datasets. You know that tailor-made data augmentation techniques are adopted to turn static images into pairs of images suitable for training the Siamese tracker submodule. I felt that the model realized something was kind of artificial and it was pretty unstable. Getting the hyperparameters right took a long time for me. But a good rule of thumb is to be very conservative, since the batch size is quite low, then the learning rate has to be adjusted accordingly. Moreover, considering how much my hardware differed from the authors, besides the overt differences in hyperparameter setting, I believe we can no longer rely on the paper in terms of training iterations. You can estimate whether you are somewhere near those numbers, but it is not that simple.
Well, you have a *.json in a specific format, and you need a *.txt out of it, once again, in a specific format. So, I implemented a small utility to do exactly that. Here is a source code for it - json2mot.py in *.txt format since GitHub would not allow *.py files. (Note: I might have made some modifications to the script, so take it with a grain of salt, I am not sure whether it works accurately).

Leo63963 commented 2 years ago

Hi @mondrasovic Thank you so much for your reply. I really appreciate it. Yeah, I am working on tracking pedestrians at present, and vehicle tracking is my future work. I am aware of the unstable things during training you mentioned in my experiments. I am not sure if I can make any significant improvement on top of that. But I will still work on that and will keep updating my results with you. The reproduction is the first step, and I will try to make some improvements based on this network, hopefully it will work. Thanks for sharing the source code and help.

noreenanwar commented 2 years ago

Hi @mondrasovic Thank you so much for your reply. I really appreciate it. Yeah, I am working on tracking pedestrians at present, and vehicle tracking is my future work. I am aware of the unstable things during training you mentioned in my experiments. I am not sure if I can make any significant improvement on top of that. But I will still work on that and will keep updating my results with you. The reproduction is the first step, and I will try to make some improvements based on this network, hopefully it will work. Thanks for sharing the source code and help.

hi how you are trying to improve this network?

Leo63963 commented 2 years ago

Hi @mondrasovic Thank you so much for your reply. I really appreciate it. Yeah, I am working on tracking pedestrians at present, and vehicle tracking is my future work. I am aware of the unstable things during training you mentioned in my experiments. I am not sure if I can make any significant improvement on top of that. But I will still work on that and will keep updating my results with you. The reproduction is the first step, and I will try to make some improvements based on this network, hopefully it will work. Thanks for sharing the source code and help.

hi how you are trying to improve this network?

I have no clues.

amazon-science / siam-mot

How should we change the siam-mot code and configuration to train multiple class tracking? #40