Some clarification questions

ttsesm commented 1 year ago

Hi @Wuziyi616,

Thanks for sharing a clean version of the bb benchmark code, it looks quite nice and well organized. I have some questions though which I would like to clarify.

If I understand correctly this config file for training tries to replicate the nsm work for the everyday subset of the breaking bad dataset.
In the nsm work though they also use the sdf module, at least this is what is described in the paper, because in the available code this is not present anywhere, which as I understand you are not making use of it here as well.
In the nsm work when they try to extract the transformation matrix error they consider also the point to point distance error in your case are you considering this somewhere in the code? Because I was trying to figure out whether and where in the code you might be applying this but I couldn't find something or it is not that clear.
Can you elaborate a bit more on the geometric loss. It is not clear what you mean by saying that we do not need to match the GT for loss computation (maybe this is also related to the previous question). Also in the code you are creating a loss dictionary but following the code I am not quite sure how this is structured and how the final loss is computed.
Also it is not clear how do you define semantic and geometric assembly, thus if you could elaborate upon this as well it would be nice.

For now I am puzzled with the above, but as I am getting more into the code probably I will come up with some more questions.

Thanks.

Wuziyi616 commented 1 year ago

Hi, thanks for your interest in our work. Here are my answers:

Regarding 1,2,3, unfortunately, that config is NOT for NSM, i.e. we don't implement NSM in this codebase. This is because NSM applies Transformer over every point in each part (i.e. the Transformer input is num_part x N tokens, N~1000), which requires lots of memory when the num_part is large (note NSM only uses 2 parts so the GPU memory is affordable). The pn_transformer implemented here applies Transformer over the PointNet global feature from each part (i.e. the Transformer input is only num_part tokens), so the memory is much lower.
Though we didn't implement NSM, we did find it work better than the baselines and pn_transformer in the everyday subset + 2 part only setting. I think this is because NSM learns local surface features which are important for geometric assembly. On the other hand, all the baselines directly apply PointNet to extract global features from each part and perform reasoning over them, which ignores the rich surface feature.
Loss matching: in semantic assembly, say we want to assemble a chair with 4 legs. The 4 legs are usually the same, i.e. they are geometrically equivalent in their canonical poses. So when calculating the loss, if leg1 is put to the correct position of leg2, and leg2 is put to the correct position of leg1, etc. The loss should be 0. In order to count for such equivalence, in semantic assembly, we need to do loss matching, i.e. match the predicted parts with its ground-truth parts via Hungarian matching (see this function).
However, in geometric assembly, parts are randomly broken, so usually, there are no geometrically equivalent parts. Therefore, we don't need to apply the Hungarian algorithm for loss matching.
The final loss is calculated here. Basically, we take (key, value) from the loss dict, if the key looks like xxx_loss, we multiply it with the xxx_loss_w in the cfg and accumulate it to the total loss.
For the difference between geo. and sem. assembly, please see our paper and Twitter thread explanation (especially the 4/12 one). So basically, building a chair from legs, arms, seat, and back is considered sem. assembly, because all parts have semantic meanings. On the other hand, building a broken vase is considered geo. assembly, as they are random fractures, and don't have semantic meanings.

Feel free to ask if you have more questions. Hope this help!

ttsesm commented 1 year ago

@Wuziyi616 thanks for you time and the feedback. Please find my comments bellow inline.

Hi, thanks for your interest in our work. Here are my answers:
* Regarding 1,2,3, unfortunately, that config is NOT for NSM, i.e. **we don't implement NSM in this codebase**. This is because NSM applies Transformer over every point in each part (i.e. the Transformer input is `num_part x N` tokens, N~1000), which requires lots of memory when the `num_part` is large (note NSM only uses 2 parts so the GPU memory is affordable). The `pn_transformer` implemented here applies Transformer over the PointNet global feature from each part (i.e. the Transformer input is only `num_part` tokens), so the memory is much lower.
I see, so at the end you just incorporated the adversarial loss on top of the pn_transformer implementation in order to check how it performs. Now it is a bit more clear. Interesting though.
* Though we didn't implement NSM, we did find it work better than the baselines and `pn_transformer` in the everyday subset + 2 part only setting. I think this is because **NSM learns local surface features** which are important for geometric assembly. On the other hand, all the baselines directly apply PointNet to extract **global features** from each part and perform reasoning over them, which ignores the rich surface feature.
What do you mean by "_we did find it work better than the baselines and pn_transformer in the everyday subset + 2 part only setting_"? As I understand it you mean the NSM implementation, right? We agree that since NSM learns local surface features, this means that it should be superior to the baselines. At least for the geometric assembly as you are pointing out. Btw, is it possible to share the NSM implementation because the implementation from this repository is not complete and when we contacted the author he pointed us to this repository but on the other hand here you do not have the NSM implementation.
* Loss matching: in semantic assembly, say we want to assemble a chair with 4 legs. The 4 legs are usually the same, i.e. they are **geometrically equivalent** in their canonical poses. So when calculating the loss, if leg1 is put to the correct position of leg2, and leg2 is put to the correct position of leg1, etc. The loss should be 0. In order to count for such equivalence, in semantic assembly, we need to do loss matching, i.e. match the predicted parts with its ground-truth parts via Hungarian matching (see this [function](https://github.com/Wuziyi616/multi_part_assembly/blob/dcda0aa88e5ddf9933095569932cdfbd34c6ff4e/multi_part_assembly/models/modules/base_model.py#L184)).

* However, in geometric assembly, parts are randomly broken, so usually, there are no geometrically equivalent parts. Therefore, we don't need to apply the Hungarian algorithm for loss matching.

Ok, I see what you mean. It is clear now. Thanks.

* The final loss is calculated [here](https://github.com/Wuziyi616/multi_part_assembly/blob/dcda0aa88e5ddf9933095569932cdfbd34c6ff4e/multi_part_assembly/models/modules/base_model.py#L417-L422). Basically, we take `(key, value)` from the loss dict, if the key looks like `xxx_loss`, we multiply it with the `xxx_loss_w` in the `cfg` and accumulate it to the total loss.

I see, seems clear.

* For the difference between geo. and sem. assembly, please see our paper and [Twitter thread explanation](https://twitter.com/ycchen918/status/1586169332685471745) (especially the `4/12` one). So basically, building a chair from legs, arms, seat, and back is considered sem. assembly, because all parts have semantic meanings. On the other hand, building a broken vase is considered geo. assembly, as they are random fractures, and don't have semantic meanings.

Sure, understood. It makes sense, however someone could say that sem. assembly is as sub-category of geom. assembly since you could also consider that there is not any semantic meaning for the chair parts and take it again as random fractures. In any case though I understand you point.

Thanks also for the twitter link, it looks interesting.

Feel free to ask if you have more questions. Hope this help!

Wuziyi616 commented 1 year ago

Indeed, I agree that sem. assembly seems to be a sub-category of geo. assembly. But there are much more information in sem. assembly, so maybe worth designing new algorithms there. But I fully agree that a unify framework that can handle both tasks will be super interesting.

Regarding the NSM code, indeed I don't implement it here, and I just tried GAN over pn_transformer to see if it helps. Interestingly, the GAN adv loss doesn't seem to help. In my preliminary experiment of NSM, I didn't use SDF loss + adv loss, so the implementation you mentioned should be good to use.

ttsesm commented 1 year ago

I see, and you found that the NSM from the aforementioned repository considering only the rotation, translation and point distance losses performs better than the baselines in the everyday subset with only 2 parts settings? Because in my case it seems to perform worst.

Wuziyi616 commented 1 year ago

I used the same loss as baselines, i.e. the geometric_loss with trans, rot, chamfer, l2, etc. I think some losses are important for geo. assembly

ttsesm commented 1 year ago

So you switched the 3 losses from here to the ones you mentioned above? Interesting.

Also in principle you could also use local features instead of global ones also in the baseline methods you are benchmarking here. Then if you wanted not to have issues with the memory you could test only for a a 2 parts settings or a limited number of parts e.g. up to 5.

Can you also explain a bit about the padding in the data. Because in practice after some debugging I did not see it used... :thinking:

Wuziyi616 commented 1 year ago

I'm not sure how we can use, say, DGL with local features. Currently given N global features, we can easily build a GNN over it. But if we have N x P per-point features for all parts, how do you build the graph? Do you mean to treat each point as a node? I haven't tried that, but wouldn't that be super slow? Because from my experiment, DGL over global features is already very slow.

The padding is simply for batch processing in PyTorch. Since different shapes have different number of parts, let's assume a chair is of shape [3, 1024], and a table is of shape [6, 1024], then we cannot stack them to form a batch. Also PyTorch DataLoader requires all loaded data to have the same shape in order to batch them. Of course, you can write a custom sampler, but I choose not to do that...

ttsesm commented 1 year ago

Ok, I got the point of the padding. Actually since I was playing with examples of only 2 parts settings it was not really making any difference and that's why I couldn't see any difference in practice. Thanks for the elaboration ;-).

For extracting local and global features I am still a bit puzzled. What I mean is, for extracting per point (local) or per cloud (global) features it is up to the settings that you pass to the encoder. Meaning that depending whether the flag global_feats= in dgcnn or pointnet is set to True or False you have the corresponding behavior. Then this means that you could activate a local extraction of features in your baselines as well, not?

Wuziyi616 commented 1 year ago

That's true, I agree. You can definitely try that and I'm also interested in how many local features will help. I'm just a bit worried about the GPU memory haha. But since you're playing around with 2 parts, maybe it's fine

Wuziyi616 commented 1 year ago

Feel free to reopen it if you have further questions!

Best, Ziyi

ttsesm commented 1 year ago

Sure, thanks for your time ;-)

ttsesm commented 1 year ago

@Wuziyi616 I have three more questions.

why in the benchmark reports you have results of your transformer implementations for the semantic assembly but not for the geometry assembly?
why do you extract the results by training individually per category for the everyday dataset and then getting the average of all categories? Wouldn't make more sense and is more scientifically correct to train the models at once on all categories and test on all categories?
what is the difference of the relative errors in comparison to the initial reported errors?

Thanks.

Wuziyi616 commented 1 year ago

This is simply because we didn't find improvement using Transformer in geo. assembly, so we don't include them. On the other hand, Transformer does outperform most of the baselines in sem. assembly, so we include them in the sem. report
We also have results of training on all categories and testing on all, see our paper Table 11 in the Appendix. Basically, the upper part of the table is the result trained on each category individually, and the bottom part of the table is train/test together
You can see the explanation of relative metrics here (the first bullet point under Experimental). The relative metrics are also designed to handle the symmetry ambiguity. Let's imagine a bottle that is broken into 5 pieces. As long as the relative pose between each piece is correct, they will form a perfectly assembled bottle. In this case, the relative metrics will always give 0 error. On the other hand, the absolute metrics (e.g. rot_mse) may still give a non-zero error, if there is a global transformation of the bottle compared to GT

ttsesm commented 1 year ago

Ok, I see.
Sure, but if I look on the reported results on the table 11 you are still considering the results per category which you average on the last column, or not?

As I understand what you did is you trained the model on all categories but you still tested individually per category and then you averaged the results.

In my case I've tried to train and test on all categories at the same time based on the split data for the everyday subset using the following commands respectively: train:

python scripts/train.py --cfg_file $CFG --fp16 --cudnn

test:

python scripts/test.py --cfg_file $CFG --weight path/to/weight

and I've got on pair results with the ones reported here. Thus, to be honest I do not really see the reason to train individually per category times x3 and then average per category as well as over the three times results when you can just give the results over all categories all together. To my opinion it becomes to much complicated without reason. Anyways, do not get me wrong it is welcome that you have done this ablation study :-).

Ok I see the point.

Btw, I am trying to play with the settings for extracting local features instead of global ones but it seems that it is not something that it can be applied directly by just setting the global_feats=False flag. Changes in the code need to be applied, right?

Wuziyi616 commented 1 year ago

Yes, I believe you need to modify some parts of the code to integrate local features

ttsesm commented 1 year ago

Is making sense to fuse the local features to global ones per piece. For example, let's say that I have a tensor like [Batch, Pieces, Points, Local Features] so something like [6, 20, 1024, 256] and would make sense to transform it to [Batch, Pieces, Fussed Features] so something like [6, 20, 256]? Which I guess this would make sense to be done in the Transformer side, right?

Wuziyi616 commented 1 year ago

This is a good question and something worth looking into. As you can see, currently we're transforming local features to global features simply with a max-pooling operation. You can try better aggregation methods, but Idk how much they will work. I think another direction is to first let local features interact with local features from other parts, then aggregate them to a global one, and predict pose from it

Also we're now using a PointNet to extract part features. PointNet is definitely not good at local feature extraction. So you can try replacing it with PointNet++ or DGCNN first

ttsesm commented 1 year ago

Interesting, thanks for the feedback. Actually I am trying to use your transformer solution and apply changes based on this. By max pooling you mean the poolings in the encoders here and here, right?

So in principle what you are suggesting is to let the features interact in the transformer and then aggregate them for passing them to pose predictor.

Can you also elaborate a bit what you do in the iterative refined version since I am not sure I got it right.

Indeed pointnet is not good on extracting local features. dgcnn seems to be a better alternative and which I am trying at the moment but I would also like to test KPConv (or SPConv) which seems to be much more superior from both.

Wuziyi616 commented 1 year ago

No, I'm talking about this line, i.e. PointNet uses global max-pooling to get the global feature
Yes, I think local features should also interact, like what's done in NSM
Iterative refinement is the same as what DGL does. We first predict [R1, T1] for each part, then transform them using these predicted poses, and predict again [R2, T2] (like a refinement), and transform, and predict again... Repeat this process multiple times

ttsesm commented 1 year ago

Thanks ;-)

wzm2256 commented 11 months ago

Hi, thanks for your interest in our work. Here are my answers:

Regarding 1,2,3, unfortunately, that config is NOT for NSM, i.e. we don't implement NSM in this codebase. This is because NSM applies Transformer over every point in each part (i.e. the Transformer input is num_part x N tokens, N~1000), which requires lots of memory when the num_part is large (note NSM only uses 2 parts so the GPU memory is affordable). The pn_transformer implemented here applies Transformer over the PointNet global feature from each part (i.e. the Transformer input is only num_part tokens), so the memory is much lower.

Though we didn't implement NSM, we did find it work better than the baselines and pn_transformer in the everyday subset + 2 part only setting. I think this is because NSM learns local surface features which are important for geometric assembly. On the other hand, all the baselines directly apply PointNet to extract global features from each part and perform reasoning over them, which ignores the rich surface feature.

Loss matching: in semantic assembly, say we want to assemble a chair with 4 legs. The 4 legs are usually the same, i.e. they are geometrically equivalent in their canonical poses. So when calculating the loss, if leg1 is put to the correct position of leg2, and leg2 is put to the correct position of leg1, etc. The loss should be 0. In order to count for such equivalence, in semantic assembly, we need to do loss matching, i.e. match the predicted parts with its ground-truth parts via Hungarian matching (see this function).

However, in geometric assembly, parts are randomly broken, so usually, there are no geometrically equivalent parts. Therefore, we don't need to apply the Hungarian algorithm for loss matching.

The final loss is calculated here. Basically, we take (key, value) from the loss dict, if the key looks like xxx_loss, we multiply it with the xxx_loss_w in the cfg and accumulate it to the total loss.

For the difference between geo. and sem. assembly, please see our paper and Twitter thread explanation (especially the 4/12 one). So basically, building a chair from legs, arms, seat, and back is considered sem. assembly, because all parts have semantic meanings. On the other hand, building a broken vase is considered geo. assembly, as they are random fractures, and don't have semantic meanings.

Feel free to ask if you have more questions. Hope this help!

I have a question here. Since NSM performs better than the baseline setting (at least in the two parts setting), why don't you put it in the benchmark? (I am indeed confused here because you said you didn't implement it.)

Wuziyi616 commented 11 months ago

Simply because NSM is only designed for 2 part settings, while our benchmark is targeted for many parts. I understand your point that we can just train/test it on the 2-part data we have. But as we decided in our project meeting, it is just not a general enough baseline for our dataset.

Wuziyi616 / multi_part_assembly

Some clarification questions #3