cgtuebingen / Flex-Convolution

Source code for: Flex-Convolution (Million-Scale Point-Cloud Learning Beyond Grid-Worlds), accepted at ACCV 2018
Apache License 2.0
115 stars 11 forks source link

modelnet classifier model spec? #13

Closed jackd closed 5 years ago

jackd commented 5 years ago

Hi, I'm trying to reproduce your modelnet results as accurately as possible and I'm hoping you can provide more details on the architecture used. Specifically

  1. How many resolution blocks did you use?
  2. What was the size of the pre-classification fully connected layer?
  3. Did you use dropout/anything fancy in the global network size of things, or just global pool -> FC(units) -> relu -> FC(40) ?
  4. What batch size/optimizer did you use?

Apologies if this information is somewhere and I'm failing to find it...

PatWie commented 5 years ago
BATCH_SIZE = 16
SUB_BATCH_SIZE = 4
NUM_POINTS = 1024
K = 8
FEATURE_LEN = 128
TREE_DEPTH = 11

class Model(ModelDesc):
    def _get_inputs(self):
        return [InputDesc(tf.float32, (BATCH_SIZE, NUM_POINTS, 3), 'point'),
                InputDesc(tf.float32, (BATCH_SIZE, 2**(TREE_DEPTH) - 1, 3), 'tree'),
                InputDesc(tf.int32, (BATCH_SIZE, ), 'label')]

    def _build_graph(self, inputs):

        _, tree, label = inputs

        level = TREE_DEPTH - 1

        points = tree[:, 2**level - 1:2**(level + 1) - 1, :]
        neighbor_hood = knn(points, k=K, subBatch=SUB_BATCH_SIZE)

        features = tf.ones([BATCH_SIZE, NUM_POINTS, 16])
        features = tf.transpose(features, [0, 2, 1])
        neighbor_hood = tf.transpose(neighbor_hood, [0, 2, 1])
        position = tf.transpose(points, [0, 2, 1])

        features = FlexConv('conv_pre', features, neighbor_hood, position,
                            FEATURE_LEN, nl=ReLU)

        for level in [9, 8, 7, 6, 5, 4, 3]:
            w = 2**(level)

            # sub-sampling step
            features = tf.transpose(features, [0, 2, 1])
            features = NeighborhoodSubsampling('subsampling_%i' % level, features)
            features = tf.transpose(features, [0, 2, 1])

            position = tree[:, 2**level - 1:2**(level + 1) - 1, :]
            # _, neighbor_hood = ops.nano_flann(position, k=min(K, w))
            neighbor_hood = knn(position, k=min(K, w), subBatch=SUB_BATCH_SIZE)

            position = tf.transpose(position, [0, 2, 1])
            neighbor_hood = tf.transpose(neighbor_hood, [0, 2, 1])

            features = FlexConv('conv%i_0' % level, features, neighbor_hood, position,
                                FEATURE_LEN, nl=ReLU)
            features = FlexConv('conv%i_1' % level, features, neighbor_hood, position,
                                FEATURE_LEN, nl=ReLU)
            features = FlexConv('conv%i_2' % level, features, neighbor_hood, position,
                                FEATURE_LEN, nl=ReLU)

            # not enought clusters in current level? --> break
            if w <= K:
                break

        # fully connected
        features_shape = features.get_shape().as_list()
        pointcloud_dim = np.prod(features_shape[1:])
        features = tf.reshape(features, [BATCH_SIZE, pointcloud_dim])
        logits = FullyConnected('fc1', features, 40, nl=tf.identity)

        # vanilla classification loss
        cls_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
        cls_loss = tf.reduce_mean(cls_loss, name="cls_costs")

        accuracy = symbf.accuracy(logits, label, name='accuracy')

        self.cost = tf.identity(cls_loss, name="total_costs")
        summary.add_moving_summary(cls_loss, self.cost, accuracy)

That was the network we were using. Pretty straight-forward. @grohf please chime in, if there is another version you have chosen the hyper-parameters :-)

jackd commented 5 years ago

Thanks @PatWie . Sorry to both you again (I know CVPR is close! :S), but can I just confirm my understanding of a few things.

  1. There are no residual-style skip connections or 1x1 convolutions in the classification architecture.
  2. The number of filters doesn't increase at any stage beyond the first convolution.
  3. No flex-max-pooling is applied at any stage - just subsampling.

All good if that's the case, they just seem like interesting choices to me, so maybe I'm misunderstanding something...

jackd commented 5 years ago

@PatWie Sorry to be a pain, but I'm even more confused now. 2 additional questions:

  1. How do you order the points in your tree? Specifically the first 8 (i.e. those remaining after all pooling)? I had assumed the model either reduced to a single point or there was some permutation-invariant pooling over the cloud size, but I see you have a reshape.
  2. The paper claims your model uses 346k parameters, but I'm getting a very different number. If each of the main loop flex convs has FILTER_LEN ** 2 * Dp = 128 ** 2 * 4 = 65k parameters (plus change from biases), and there's 7 blocks of 3, doesn't that leave you with 1.4m or so? Not including initial flex conv or final dense.
PatWie commented 5 years ago

Disclaimer: I should have stated this before: I haven't trained the models for ModelNet40 for the paper. I just helped to monitor the losses and the logs (basically baby-sit the training). I had trained a few models on ModelNet40 with slightly less accuracy before. (After all a dump classifier gets 84% accuracy).

  1. Yes, for modelnet none of my models had skip-connections
  2. Yes, I kept the number of filters constant
  3. NeighborhoodSubsampling was another name for a max-pooling. Like stated in the paper the downsampling factor was always 4.
  4. I personally used random orders for preliminary experiments, but Fabian (@grohf ) did use IDISS as explained in the paper to order the points and get better improvement. The idea is to order the points such that the lower half of the point cloud is removed.
  5. To be clear i had models with #params varying between 51837-142269. One of the "tricks" for the best performance was to really to get the order correct.

An additional trick I learned was to split the number of channels into groups and applying flex-conv to each group separately helps:

[H,W,128] 
--> 1x1 conv 
--> split
--> different flex-conv [H,W,16] [H,W,16] [H,W,16] [H,W,16]  [H,W,16] [H,W,16] [H,W,16] [H,W,16] 
--> concat
--> 1x1 conv 

This also reduces number of parameters from 128**2*3 to 16**2*3*8 plus some 1x1 kernels. This made it much faster.

Something nobody would write in the paper is that ModelNet40 is not helpful at all. Even with bugs we reached pretty high accuracy.

Are you aware of https://github.com/hkust-vgd/scanobjectnn ?

jackd commented 5 years ago

I totally agree ModelNet40 is... odd. I took solice in finding the supplementary material in this paper and seeing the flower pot vs plant quiz. I'll have a look at scanobjectnn - thanks!

On the point of pooling, isn't position = tree[:, 2**level - 1:2**(level + 1) - 1, :] removing half of the points per iteration? And what is w if not the cloud size? And how do you do 7 quaterings of 1024 points?

grohf commented 5 years ago

You are right I'm on time pressure for CVPR. Thus, sorry for answering late (and short). I can not stretch how much I dislike results ModelNet40 for a lot of reasons... But anyhow, I should add some notes :)

First, the model used in the paper was really dump duel-flex-conv on only 2 hierarchies and a convolution out of the center with a 2-fc-layer at the end. At that time we used the dataset provided by PointNet2 with 10k points. I got asked to redo the inference with multiple random sets and I need to admit that it varies quite a bit [~88%-92.5%]...

Later on in a student project, we re-sampled the original dataset in a way that we make sure to only sample on outer surfaces of objects. (There are a lot of objects with garbage inside...) Simple ResNeXt -Style Flex-conv w/o down sampling and simple max-pooling over the feature into FC is already enough to get into the same range (and some times even better....) That strengthens my assumption, that global features are more important in object classification on ModelNet40 than relative dependencies.

But to make that clear, in my opinion results above 90% on that Modelnet40 should be taken with a big grain of salt.

I hope you got some insights :)

jackd commented 5 years ago

@grohf I appreciate your time - and I agree, pushing the limits of accuracy on modelnet is not a pursuit I'm interested in. My work is looking at simplifying and downscaling models. My own CVPR submission is looking to reproduce this work as a baseline, so I want to do as fair a job as possible.

Could you ellaborate on what you mean by "dump duel-flex-conv"?

In terms of the rest of the training details, I'll be assuming the following, but please let me know if there's anything I should change.

Thanks again.