Hand-designed on Graphene (VGG, ResNet, etc.)

dylanrandle commented 4 years ago

We want to use famous/well-known architectures (cells) on the datasets.

Make this as independent of the dataset as possible.

JiaweiZhuang commented 4 years ago

Even just a tiny CNN can get R^2 score of 0.927 on test set and start to overfit (https://github.com/capstone2019-neuralsearch/AC297r_2019_NAS/commit/b4f45867379dda9f1247ff0f790ce7738dabbaf0). A bigger model like ResNet seems an overkill for this small graphene dataset.

It is trained & tested on the 3x5 coarse grid though. Not sure why the 30x80 fine grid is necessary, given that the two grids have one-to-one mappings.

JiaweiZhuang commented 4 years ago

I have successfully trained ResNet-18 (in PyTorch) on the 30x80 fine-grid graphene data (https://github.com/capstone2019-neuralsearch/AC297r_2019_NAS/commit/b826349e01f585aaead783e05fed33b8a988e582). The R^2 is 0.87, much higher than a simple 2-layer CNN training on the same fine-grid data (R^2 \~0.6). The entire pipeline can be reproduced by this Kaggle GPU kernel. (There is also a simple CNN version for reference).

I also committed a reference PyTorch ResNet implementation training on CIFAR-10 (~84% accuracy) (https://github.com/capstone2019-neuralsearch/AC297r_2019_NAS/commit/ed117940a9f43c8be62047ca0b3d2d822e3f990c) ; it can be reproduced at this colab link.

It would be great to get some GCP/AWS credits, so we don't have to heavily rely on Colab/Kaggle. Although the free GPUs are great, the resources are still quite limited (only a single K80; the training can go much faster with several V100). Version-control is also a bit annoying.

JiaweiZhuang commented 4 years ago

The fine-grid graphene data is a tricky problem, because we can easily hand-design a tiny NN that works very well:

For the first layer, use MaxPooling/AveragePooling to sub-sample the input fine-grid image to back to a coarse grid.
Then, use a tiny CNN to easily get R^2~0.92 on the coarse grid.

This works because the both the training and the test sets can be perfectly encoded by a much lower-dimensional coarse grid version.

JiaweiZhuang commented 4 years ago

I got R^2 ~ 0.92 by training ResNet-18 for more epochs. See this Kaggle notebook. This matches the accuracy in their original paper (also R^2 ~ 0.92)

JiaweiZhuang commented 4 years ago

Another interpretation of "hand-designed architecture" is to use a residual block for each cell, to replace DARTS's learned-cell as shown in their paper:

As I understand, a residual block inside the DARTS framework would look like:

c_{k-2} should not be used at all. The macro-cells are just stacked sequentially.
c_{k-1} should go through a conv (which means Conv2d + BN + ReLU) to node 0, and then another conv to node 1
node 1 should just be the output
node 2 and node 3 are not used. Alternatively, they can repeat what node 0 and node 1 do, and then node 3 will be the output.

The diagram looks like (plotted via https://github.com/capstone2019-neuralsearch/AC297r_2019_NAS/commit/f3e647d1e3097fbbc8c17a4b9dd9406548dcb7ee):

or

two_blocks png

One problem with implementation: the DARTS code always concats four intermediate nodes as cell output, but here we just want one intermediate node without any concatenation.

Another question is whether this whole idea is worth implementing. The project documentation says "ResNets" instead of "Residual blocks":

Understand and compare state-of-the-art architectures → VGG, GoogLeNet, ResNets, DenseNets, Highway Networks, etc.

dylanrandle commented 4 years ago

I also have something similar, although I am using the c_{k-2} output as well, as my "first" residual connection.

normal

Important note: the visualization does not accurately reflect the concatenation. As can be seen in, e.g., here, the concatenation is only applied over the specified normal_concat or reduce_concat. For example, the code would look like:

# Hand-designed ResNet Architecture
RESNET = Genotype(
    normal=[
    ('skip_connect', 0),
        ('sep_conv_3x3', 1),
        ('skip_connect', 1),
        ('sep_conv_3x3', 2),
        ('skip_connect', 2),
        ('sep_conv_3x3', 3),
        ('skip_connect', 3),
        ('sep_conv_3x3', 4)],
    normal_concat=[5], # whatever the idx of last box is
    reduce=[
        ('skip_connect', 0),
        ('sep_conv_3x3', 1),
        ('skip_connect', 1),
        ('sep_conv_3x3', 2),
        ('skip_connect', 2),
        ('sep_conv_3x3', 3),
        ('skip_connect', 3),
        ('sep_conv_3x3', 4)],
    reduce_concat=[5]) # whatever the idx of last box is

JiaweiZhuang commented 4 years ago

In order to match the number of parameters in DARTS, I shrink the standard ResNet-18 into a "ResNet-10" and reduced the number of filters by a factor 8 (from 64, 128... to 8, 16...). The model now only has 77k parameters, compared to >10,000k parameters in the standard ResNet-18 (see the sizes of common models). The R^2 is still ~0.92. See this Kaggle notebook to reproduce the result. The reference implementation is torchvision.models.resnet

The difference between "ResNet-10" and ResNet-18 is basically:

def ResNet10(**kwargs):
    return ResNet(BasicBlock, [1,1,1,1], **kwargs) 

def ResNet18(**kwargs):
    return ResNet(BasicBlock, [2,2,2,2], **kwargs)

From ResNet paper, the [2,2,2,2] means the number of conv blocks:

Changing it to [1,1,1,1] reduces the layers from 8*2+2 = 18 to 4*2+2 = 10.

capstone2019-neuralsearch / AC297r_2019_NAS

Hand-designed on Graphene (VGG, ResNet, etc.) #3