AAnoosheh / ToDayGAN

http://arxiv.org/abs/1809.09767
BSD 2-Clause "Simplified" License
172 stars 32 forks source link

About the training in the paper #3

Closed shamangary closed 5 years ago

shamangary commented 5 years ago

Hello. Good project. It's just that I read the paper and some lines are very confusing. Hope you might give me some clarifications.

In section IV, A, there are some wired statements:

As the problem is formulated below, the
same Daytime images are available during both training and
inference, so the same images are used for both, meanwhile
the night images are independent in all stages.

If some images are used for both training and testing(inference), the experimental itself would be quite wrong. However, the same paragraph also tells me to look at TABLE 1 which does not show any of this Daytime images.

Does this statement simply mean that the synthetic Daytime images are generated from the night images by the network, so there are both daytime images for both training and testing? This is quite confusing.

Thanks

AAnoosheh commented 5 years ago

Hi Tsun-Yi, You are reading it correctly, some images are used for both training and testing.

We state in our problem definition from the very beginning that we are trying to match night-time traversal to a known daytime traversal. This means we always have access to the daytime images, so we can use them to train. We could just as easily use other similar daytime images, though since we don’t need to, why not get better, more direct results using the set of images we wish to match against anyway?

During inference, Daytime images are not fed into the network, since we only convert Night-to-Day (even though training the network requires the other direction as well, but we ignore it afterwards). When I say Daytime is available during inference, I mean it’s physically there, and we use them to match against the synthetic Night-to-Day images.

We do, however, separate the nighttime images used for training and testing. So in short, we have a Daytime, Night-Train, and Night-Test. We wish to translate Night-Test to be as similar as possible to Daytime (not just any Daytime – that specific Daytime set). So we train the model on Daytime and Night-Train. We then infer Night-test and compare it with the raw Daytime images.

Hopefully this clears things up, Best, Asha

P.S. the paragraph actually refers the reader to Table I for dataset statistics, and says Figure 4 for images.

shamangary commented 5 years ago

I see your statement now, but there are still some confusions remained. Considering the paper claims to achieve 250% improvement against the state-of-the-art, I consider the idea comes from Table V which comes from the dataset paper of Sattler et al.: [CVPR18] Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions.

However, I don't see the same training/testing procedure in the previous paper. The comparison is valid only if the same process is done during training/testing. Otherwise, the comparison is not fair. Please verify this for me.

Besides, why don't you just split "daytime-train corresponding to night-train" with "daytime-test corresponding to night-test" for clarity? I believe this would also be more convincing too.

About the daytime image, you should need the output feature of the network in order to retrieve the similar image. Why does it not going through the network during inference?

Thanks

AAnoosheh commented 5 years ago

I'm not sure how to respond because it sounds like you didn't read the paper properly. None of what you said make sense to me, unfortunately. I'd suggest you simply re-read it, since I'd just be explaining the entire thing all over again. Thanks!

shamangary commented 5 years ago

Sorry you see it that way. I simply ask a simple question. Considering you compare with the [19] in Table V with much better results. Are the settings of both paper the same? Because I didn't see the same setting(splitting night subset) of the training/testing in [19].

[19] [CVPR18] Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions.

AAnoosheh commented 5 years ago

[19] is a benchmark paper. They are a compilation of other peoples' results - not their own method. The methods it contains don't do anything even similar to this. They use traditional Computer-Vision matching techniques without deep learning. This is the first paper to use image translation for such a task.

They use the same Daytime and Night-test training set as I do. I made my own Night-train set for training purposes. The first author on that paper is also my co-author here :)

shamangary commented 5 years ago

I know the co-author is Sattler and that's why I am asking here.

NetVLAD is a deep learning method, so I am still confused.

A method can only be recognized as better when all the competing methods are on the same training/tesing setting.

I suppose it seems like the competing methods (e.g.NetVLAD) does not have the same training setting here since you categorized it as other retrieval methods?

AAnoosheh commented 5 years ago

Oh yes sorry forgot NetVlad is part of them. That's using an off-the-shelf pretrained network from the creators.

It cannot be trained on our dataset because we only have ~10k images whereas they used millions. Traditional networks cannot be trained using such few images. Image translation, on the other hand can.

Additionally they had labeled images, i.e. correspondences, and we do not. Traditional networks require labels to train. Image translation does not.

Hence the whole point of my method is to do something that avoids requiring large amounts of labeled data. Also I'm not making claims about whether my method is better than other general methods/architectures or not. Just that my method works better than using the official off-the-shelf Netvlad.

shamangary commented 5 years ago

Since you accuse me of not reading your paper while it turns out you don't even know your baselines are CNN and trainable, I will be more direct.

When you indicate your method is 250% better than the state-of-the-art while the standard of the training dataset is different, you are already making a false statement.

If the method cannot be trained or fine-tuned on the same problem definition, you should not claim your performance improved against them. Especially when NetVLAD is just a layer concept, it's not a fixed network structure, you can always retrain it by adding NetVLAD layer to a similar design of your method or similar structures.

I did not want to pick a fight here, but you don't seem to welcome the simple questions I asked. I suggest you rewrite the paper and make it more clear.

AAnoosheh commented 5 years ago

I see your point. It's not that I don't welcome the questions, just that I legitimately thought you misunderstood the concept of the paper in your second post. (i.e. you asked about why day images don't go through the network during inference, so that kinda led me to think that you were still misunderstanding the basics)

Though with my method, anyone can (and should) quickly train the model on their own dataset with only a few thousand images and no labels. So it's more of a procedure meant to be customized by each particular problem (location/camera/etc). NetVLAD and such are also concepts that anyone can also attach to their own net, but at the same time require massive labeled datasets that one cannot expect a re-training per use-case. So I see them as things that should be generalizable from the get-go.

The word "NetVLAD" in the paper is not intended to refer to the concept of NetVLAD, as I now realize it may have caused some confusion. I'm simply saying that doing this - including re-training the network for each custom situation - is better than using the pretrained NetVLAD as is, which many people are actually doing to this day.

So whether someone thinks this is "fair" or not is a bit subjective because the training difficulties themselves are not comparable. But if you still think so, I respect your opinion. Thanks for the feedback, I appreciate all forms of criticism. And no offense taken on my part.