Closed ultmaster closed 4 years ago
Thank you for your very detailed issue!
Regarding point 1: I agree with you that the wording in our paper is suboptimal. What we meant to say for search space 3 is: 'All available intermediate nodes (up to 5) are used in this search space'. Let me know if you feel like we could still improve the sentence otherwise we will update the paper accordingly.
Regarding point 2: I am looking into this and will get back to you as soon as possible :)
In case a node is never used, what does "no. of parents" of this node mean?
Nevertheless, it's just a literature problem. I'm okay either way.
To have a fixed discretization strategy as in DARTS we select the parents for each intermediate nodes, based e.g. in DARTS on the architectural weights. Therefore, there are no unused nodes, since every node will have parents. However, there are the loose ends as discussed in the paper. Hope this clarifies things. Will get back to you on point 2 of your initial comment.
We are adding the updated version of the search space counting in the new version and are in the process of updating the paper. These are the updated numbers. Thank you again @ultmaster for pointing out the incorrect counting of the graphs w/o loose ends!
First, thanks for this open source repo!
I tried to reproduce your search space 3 before this code is released. I found 445 different topologies with loose ends (allowing input->output, which I will raise another issue), altogether
445*3^5
= 108135 architectures. But your paper and code only reports 234 different topologies (24066 in all !=234*3^5
). Digging deeper, I found two reasons for this:Node 1 is never used (no input, no output). Of course, your code handles this properly, by ignoring not-used nodes in architecture validation. But I don't think it's consistent with "All 5 intermediate nodes are used in this search space" in paper.
To visualize, I fill node 1-5 with conv3x3.
Interestingly, generating without loose ends uses a completely different code from with loose ends. The core of this part is:
The problem here is: backtracking info is not kept when traversing the second node. In the example above, when visiting node 5, all edges connecting to node 4 is lost. Although some graphs are yielded during traverse of node 4, traverse of node 5 is still based on the graph of processing node 6. Therefore, in a word, this code cannot generate DAG with "branches".
Maybe this is done delibrately because some of the downstream optimizers cannot handle branch structure. But as far as I know, this is not mentioned in either your paper or related literature.
I'll really appreciate it if you are able to share more details.