D-X-Y / AutoDL-Projects

Automated deep learning algorithms implemented in PyTorch.
MIT License
1.56k stars 281 forks source link

[NAS-Bench-201] Questions of the macro skeleton of architecture and API #85

Closed awecefil closed 3 years ago

awecefil commented 3 years ago

First of all, thank you for your excellent work。

Here I have 3 questions about NAS-bench-201, Q1. The cell structure detail: For each cell in the skeleton, there are 4 nodes and 6 edges due to the densely-connected DAG. The question is that whether each edges must have only 1 operation? those cells which have more than 1 operation in edges are not in the 15,625 cells/candidates? Q2. The output for each cell: In the paper, it says "each node represents the sum of all feature maps transformed through the associated operations of the edges pointing to this node". Does it means that the output of each cell will also be "sum" but not "concatenate"? Why I have this question is because in DARTS or other one-shot model method, the cell output usually be concatenated by all the nodes output in the cell, so the output channel size will be input channel size * number of nodes.
Q3. The API of NAS-bench-201: When I use the NAS-bench-201 API, it usually takes some time for initializing for each time I run it. Is it a normal thing? image

sbl1996 commented 3 years ago

I am not the contributors, but I would like to share some of my opinions.

Q1:

  1. 5^6 = 15,625.
  2. 1 operation per cell is common in NAS, especially in the popular NASNet search space.

Q2: It seems that the authors want to include ResNet-like architecture into the search space and it can't be achieved using "concatenate".

Q3: I am also troubled by the same question. In addition, this initialization process will consume more than 25G memory which is too large to afford for me and (I believe) many researchers.

D-X-Y commented 3 years ago

Thanks for your interests @awecefil and thanks for your answer @sbl1996 . Q1: Each edge has exactly one operation

Q2: Yes, we want to include ResNet-like architecture. the output of each cell is "sum" not "concat".

Q3: It is because there are "many" information for all 15,625 architectures. And in the initialization procedure, it will load all of this information, which take minutes. As @sbl1996 mentioned, it might be too large, I'm recently working to optimize it and will let you ASAP when I finished.

awecefil commented 3 years ago

Thanks for yours @D-X-Y @sbl1996 answer for my questions. It really helps me to realize more detail about this work.

By the way, I have another question about the Residual block which used in the macro skeleton. Is the structure of residual block looks like this?

image (Input is the output of the last cell in each stage)

D-X-Y commented 3 years ago

No, the structure is defined at here: https://github.com/D-X-Y/AutoDL-Projects/blob/master/lib/models/cell_operations.py#L76 the downsample is performed at the first 3x3 conv by using stride = 2 instead of using a avg-pool right after the input.

awecefil commented 3 years ago

Thanks for your answer! So the correct structure is this right? (where output channel is input channel*2 ) image

D-X-Y commented 3 years ago

Yes, correct.

awecefil commented 3 years ago

@D-X-Y Many thanks for your answer and reply!

awecefil commented 3 years ago

Sorry, I have another question when I use NAS-Bench-201 on One-shot model searching. Please check these 2 cell architectures below These 2 cell architectures are the same and they should have equal performance because feature maps can only passed by 3x3 pool(Node 0 to Node 3), the feature maps of Node 1 & Node 2 cannot pass due to they both have 'none' op in their last edge.

However, when I use NAS-Bench-201 to get these 2 cell architecture's information, there accuracy are not the same and I think it is normal because there are some randomness when splits the mini-batch and causes the weight of stem and FC layer be different, but in One-shot model, they "should" have same performance due to weight sharing. image image

And I think this little performance different due to training the stand-alone network will cause a problem that we can't calculate the correct rank correlation between subnetwork & stand-alone network.

Do you have any suggestion about this problem?

D-X-Y commented 3 years ago

This is a good question. This problem happens in all cell-based NAS methods, because some "different" cells are isomorphic.

First of all, the stand-alone performance of two isomorphic models are very similar, and thus I don't think it will harm the correlation a lot. Second, in my codes, I provide a way to identify such isomorphic cells. You could use it to identify such cells, consider them as the same model to compute the correlation.