Closed ttsesm closed 1 year ago
Hi, thanks for your interest in our work. Here are my answers:
num_part x N
tokens, N~1000), which requires lots of memory when the num_part
is large (note NSM only uses 2 parts so the GPU memory is affordable). The pn_transformer
implemented here applies Transformer over the PointNet global feature from each part (i.e. the Transformer input is only num_part
tokens), so the memory is much lower.pn_transformer
in the everyday subset + 2 part only setting. I think this is because NSM learns local surface features which are important for geometric assembly. On the other hand, all the baselines directly apply PointNet to extract global features from each part and perform reasoning over them, which ignores the rich surface feature.(key, value)
from the loss dict, if the key looks like xxx_loss
, we multiply it with the xxx_loss_w
in the cfg
and accumulate it to the total loss.4/12
one). So basically, building a chair from legs, arms, seat, and back is considered sem. assembly, because all parts have semantic meanings. On the other hand, building a broken vase is considered geo. assembly, as they are random fractures, and don't have semantic meanings.Feel free to ask if you have more questions. Hope this help!
@Wuziyi616 thanks for you time and the feedback. Please find my comments bellow inline.
Hi, thanks for your interest in our work. Here are my answers:
* Regarding 1,2,3, unfortunately, that config is NOT for NSM, i.e. **we don't implement NSM in this codebase**. This is because NSM applies Transformer over every point in each part (i.e. the Transformer input is `num_part x N` tokens, N~1000), which requires lots of memory when the `num_part` is large (note NSM only uses 2 parts so the GPU memory is affordable). The `pn_transformer` implemented here applies Transformer over the PointNet global feature from each part (i.e. the Transformer input is only `num_part` tokens), so the memory is much lower.
I see, so at the end you just incorporated the adversarial loss on top of the
pn_transformer
implementation in order to check how it performs. Now it is a bit more clear. Interesting though.* Though we didn't implement NSM, we did find it work better than the baselines and `pn_transformer` in the everyday subset + 2 part only setting. I think this is because **NSM learns local surface features** which are important for geometric assembly. On the other hand, all the baselines directly apply PointNet to extract **global features** from each part and perform reasoning over them, which ignores the rich surface feature.
What do you mean by "_we did find it work better than the baselines and
pn_transformer
in the everyday subset + 2 part only setting_"? As I understand it you mean the NSM implementation, right? We agree that since NSM learns local surface features, this means that it should be superior to the baselines. At least for the geometric assembly as you are pointing out. Btw, is it possible to share the NSM implementation because the implementation from this repository is not complete and when we contacted the author he pointed us to this repository but on the other hand here you do not have the NSM implementation.* Loss matching: in semantic assembly, say we want to assemble a chair with 4 legs. The 4 legs are usually the same, i.e. they are **geometrically equivalent** in their canonical poses. So when calculating the loss, if leg1 is put to the correct position of leg2, and leg2 is put to the correct position of leg1, etc. The loss should be 0. In order to count for such equivalence, in semantic assembly, we need to do loss matching, i.e. match the predicted parts with its ground-truth parts via Hungarian matching (see this [function](https://github.com/Wuziyi616/multi_part_assembly/blob/dcda0aa88e5ddf9933095569932cdfbd34c6ff4e/multi_part_assembly/models/modules/base_model.py#L184)). * However, in geometric assembly, parts are randomly broken, so usually, there are no geometrically equivalent parts. Therefore, we don't need to apply the Hungarian algorithm for loss matching.
Ok, I see what you mean. It is clear now. Thanks.
* The final loss is calculated [here](https://github.com/Wuziyi616/multi_part_assembly/blob/dcda0aa88e5ddf9933095569932cdfbd34c6ff4e/multi_part_assembly/models/modules/base_model.py#L417-L422). Basically, we take `(key, value)` from the loss dict, if the key looks like `xxx_loss`, we multiply it with the `xxx_loss_w` in the `cfg` and accumulate it to the total loss.
I see, seems clear.
* For the difference between geo. and sem. assembly, please see our paper and [Twitter thread explanation](https://twitter.com/ycchen918/status/1586169332685471745) (especially the `4/12` one). So basically, building a chair from legs, arms, seat, and back is considered sem. assembly, because all parts have semantic meanings. On the other hand, building a broken vase is considered geo. assembly, as they are random fractures, and don't have semantic meanings.
Sure, understood. It makes sense, however someone could say that sem. assembly is as sub-category of geom. assembly since you could also consider that there is not any semantic meaning for the chair parts and take it again as random fractures. In any case though I understand you point.
Thanks also for the twitter link, it looks interesting.
Feel free to ask if you have more questions. Hope this help!
Indeed, I agree that sem. assembly seems to be a sub-category of geo. assembly. But there are much more information in sem. assembly, so maybe worth designing new algorithms there. But I fully agree that a unify framework that can handle both tasks will be super interesting.
Regarding the NSM code, indeed I don't implement it here, and I just tried GAN over pn_transformer
to see if it helps. Interestingly, the GAN adv loss doesn't seem to help. In my preliminary experiment of NSM, I didn't use SDF loss + adv loss, so the implementation you mentioned should be good to use.
I see, and you found that the NSM from the aforementioned repository considering only the rotation, translation and point distance losses performs better than the baselines in the everyday subset with only 2 parts settings? Because in my case it seems to perform worst.
I used the same loss as baselines, i.e. the geometric_loss
with trans, rot, chamfer, l2, etc. I think some losses are important for geo. assembly
So you switched the 3 losses from here to the ones you mentioned above? Interesting.
Also in principle you could also use local features instead of global ones also in the baseline methods you are benchmarking here. Then if you wanted not to have issues with the memory you could test only for a a 2 parts settings or a limited number of parts e.g. up to 5.
Can you also explain a bit about the padding in the data. Because in practice after some debugging I did not see it used... :thinking:
I'm not sure how we can use, say, DGL with local features. Currently given N
global features, we can easily build a GNN over it. But if we have N x P
per-point features for all parts, how do you build the graph? Do you mean to treat each point as a node? I haven't tried that, but wouldn't that be super slow? Because from my experiment, DGL over global features is already very slow.
The padding is simply for batch processing in PyTorch. Since different shapes have different number of parts, let's assume a chair is of shape [3, 1024]
, and a table is of shape [6, 1024]
, then we cannot stack them to form a batch. Also PyTorch DataLoader requires all loaded data to have the same shape in order to batch them. Of course, you can write a custom sampler, but I choose not to do that...
Ok, I got the point of the padding. Actually since I was playing with examples of only 2 parts settings it was not really making any difference and that's why I couldn't see any difference in practice. Thanks for the elaboration ;-).
For extracting local and global features I am still a bit puzzled. What I mean is, for extracting per point (local) or per cloud (global) features it is up to the settings that you pass to the encoder. Meaning that depending whether the flag global_feats=
in dgcnn or pointnet is set to True
or False
you have the corresponding behavior. Then this means that you could activate a local extraction of features in your baselines as well, not?
That's true, I agree. You can definitely try that and I'm also interested in how many local features will help. I'm just a bit worried about the GPU memory haha. But since you're playing around with 2 parts, maybe it's fine
Feel free to reopen it if you have further questions!
Best, Ziyi
Sure, thanks for your time ;-)
@Wuziyi616 I have three more questions.
Thanks.
Experimental
). The relative metrics are also designed to handle the symmetry ambiguity. Let's imagine a bottle that is broken into 5 pieces. As long as the relative pose between each piece is correct, they will form a perfectly assembled bottle. In this case, the relative metrics will always give 0
error. On the other hand, the absolute metrics (e.g. rot_mse
) may still give a non-zero error, if there is a global transformation of the bottle compared to GTAs I understand what you did is you trained the model on all categories but you still tested individually per category and then you averaged the results.
In my case I've tried to train and test on all categories at the same time based on the split data for the everyday subset using the following commands respectively: train:
python scripts/train.py --cfg_file $CFG --fp16 --cudnn
test:
python scripts/test.py --cfg_file $CFG --weight path/to/weight
and I've got on pair results with the ones reported here. Thus, to be honest I do not really see the reason to train individually per category times x3 and then average per category as well as over the three times results when you can just give the results over all categories all together. To my opinion it becomes to much complicated without reason. Anyways, do not get me wrong it is welcome that you have done this ablation study :-).
Btw, I am trying to play with the settings for extracting local features instead of global ones but it seems that it is not something that it can be applied directly by just setting the global_feats=False
flag. Changes in the code need to be applied, right?
Yes, I believe you need to modify some parts of the code to integrate local features
Is making sense to fuse the local features to global ones per piece. For example, let's say that I have a tensor like [Batch, Pieces, Points, Local Features] so something like [6, 20, 1024, 256] and would make sense to transform it to [Batch, Pieces, Fussed Features] so something like [6, 20, 256]? Which I guess this would make sense to be done in the Transformer side, right?
This is a good question and something worth looking into. As you can see, currently we're transforming local features to global features simply with a max-pooling operation. You can try better aggregation methods, but Idk how much they will work. I think another direction is to first let local features interact with local features from other parts, then aggregate them to a global one, and predict pose from it
Also we're now using a PointNet to extract part features. PointNet is definitely not good at local feature extraction. So you can try replacing it with PointNet++ or DGCNN first
Interesting, thanks for the feedback. Actually I am trying to use your transformer solution and apply changes based on this. By max pooling you mean the poolings in the encoders here and here, right?
So in principle what you are suggesting is to let the features interact in the transformer and then aggregate them for passing them to pose predictor.
Can you also elaborate a bit what you do in the iterative refined version since I am not sure I got it right.
Indeed pointnet is not good on extracting local features. dgcnn seems to be a better alternative and which I am trying at the moment but I would also like to test KPConv (or SPConv) which seems to be much more superior from both.
Thanks ;-)
Hi, thanks for your interest in our work. Here are my answers:
- Regarding 1,2,3, unfortunately, that config is NOT for NSM, i.e. we don't implement NSM in this codebase. This is because NSM applies Transformer over every point in each part (i.e. the Transformer input is
num_part x N
tokens, N~1000), which requires lots of memory when thenum_part
is large (note NSM only uses 2 parts so the GPU memory is affordable). Thepn_transformer
implemented here applies Transformer over the PointNet global feature from each part (i.e. the Transformer input is onlynum_part
tokens), so the memory is much lower.- Though we didn't implement NSM, we did find it work better than the baselines and
pn_transformer
in the everyday subset + 2 part only setting. I think this is because NSM learns local surface features which are important for geometric assembly. On the other hand, all the baselines directly apply PointNet to extract global features from each part and perform reasoning over them, which ignores the rich surface feature.- Loss matching: in semantic assembly, say we want to assemble a chair with 4 legs. The 4 legs are usually the same, i.e. they are geometrically equivalent in their canonical poses. So when calculating the loss, if leg1 is put to the correct position of leg2, and leg2 is put to the correct position of leg1, etc. The loss should be 0. In order to count for such equivalence, in semantic assembly, we need to do loss matching, i.e. match the predicted parts with its ground-truth parts via Hungarian matching (see this function).
- However, in geometric assembly, parts are randomly broken, so usually, there are no geometrically equivalent parts. Therefore, we don't need to apply the Hungarian algorithm for loss matching.
- The final loss is calculated here. Basically, we take
(key, value)
from the loss dict, if the key looks likexxx_loss
, we multiply it with thexxx_loss_w
in thecfg
and accumulate it to the total loss.- For the difference between geo. and sem. assembly, please see our paper and Twitter thread explanation (especially the
4/12
one). So basically, building a chair from legs, arms, seat, and back is considered sem. assembly, because all parts have semantic meanings. On the other hand, building a broken vase is considered geo. assembly, as they are random fractures, and don't have semantic meanings.Feel free to ask if you have more questions. Hope this help!
I have a question here. Since NSM performs better than the baseline setting (at least in the two parts setting), why don't you put it in the benchmark? (I am indeed confused here because you said you didn't implement it.)
Simply because NSM is only designed for 2 part settings, while our benchmark is targeted for many parts. I understand your point that we can just train/test it on the 2-part data we have. But as we decided in our project meeting, it is just not a general enough baseline for our dataset.
Hi @Wuziyi616,
Thanks for sharing a clean version of the bb benchmark code, it looks quite nice and well organized. I have some questions though which I would like to clarify.
For now I am puzzled with the above, but as I am getting more into the code probably I will come up with some more questions.
Thanks.