aofrancani / TSformer-VO

Implementation of the paper "Transformer-based model for monocular visual odometry: a video understanding approach".
https://arxiv.org/abs/2305.06121
MIT License
71 stars 11 forks source link

window_sizw #15

Closed 110sha closed 7 months ago

110sha commented 7 months ago

Hello, thank you very much for your multiple replies. I apologize for bothering you again. I have a question. Should the window_size in these three places (train.py and kitti. py) in the picture be the same? When the value is 2, it represents VO1, when the value is 3, it represents VO2, and when the value is 4, it represents VO3. May I ask if this is the understanding?

110sha commented 7 months ago

QQ截图20240418201511

aofrancani commented 7 months ago

Yes, they are the same and you got it correctly.

In "kitti.py" you will read the data with "window_size" frames in each iteration. (https://github.com/aofrancani/TSformer-VO/blob/main/train.py#L223), and in the model's output, you will have "window_size -1" estimations because you have one pose estimation for each 2 consecutive frames.

The value "window_size=3" you see in "kitti.py" is just a default value if you don't mention it when reading the data...

110sha commented 7 months ago

Thank you very much for your reply. If it is not mentioned when reading data, the value "window_size=3" seen in "kitti. py" is only a default value, and we do not need to worry about it. So just modify the window size=2, 3, and 4 in train. py to represent VO-1, 2, and 3, respectively. But I only changed (window_size: 2) to (window_size: 3) in train. py, and the resulting error was quite large. So I would like to ask if there are any other parameters that need to be modified accordingly when modifying the window_size value in train.py?

aofrancani commented 7 months ago

No, the window_size parameter is independent of the others. What I used to do was set the overlap to "window_size - 1", so that the larger the window, the more data I got to train (with redundancy in the batches, because from one video clip to the next only one frame has changed). So, the overlap between the windowed data might be the other parameter you are looking for...

110sha commented 7 months ago

I'm very sorry, I was so foolish that I reread the article and still don't quite understand how to implement it in the code. Initially, train. py: window_size=2; Kitti. py: window_size=3, which means the overlapping frame rate is 2, representing VO1? Then I conducted two experiments according to my understanding: 1: In train, window_size=3, and in kitti, window_size=4, indicating an overlapping frame rate of 3, i.e. VO2; 2: In train, window_size=4, and in kitti, window_size=5, indicating an overlapping frame rate of 4, i.e. VO3; But the result is still not right. Has my understanding gone wrong again? I hope to receive your guidance again! Thank you.

aofrancani commented 7 months ago

I'm sorry I didn't get it... What do you mean the result is not right? the expected size of your windowed data or the final evaluation metrics after/during your training?

110sha commented 7 months ago

I will reproduce your code, and if no changes are made, the final error result will be similar to that in your paper. But if I want to reproduce TSformer VO-2 and TSformer VO-3, how should I change it? I made the changes according to this idea, and the final error was significant.

1: In train, window_size=3, and in kitti, window_size=4, indicating an overlapping frame rate of 3, i.e. VO2; 2: In train, window_size=4, and in kitti, window_size=5, indicating an overlapping frame rate of 4, i.e. VO3;

Simply put, I don't understand how to modify code. Where to train VO2 and VO3?谢谢

aofrancani commented 7 months ago

Ok, so you mean the final error after training everything...

So, the only thing you should edit is the "train.py", you don't need to worry about "kitty.py" because when we read the data we pass the parameter "args["window_size"]" as input to the dataloader.

I hope this helps!