SamsungLabs / zero-cost-nas

Zero-Cost Proxies for Lightweight NAS
Apache License 2.0
140 stars 20 forks source link

Why does the fuction find_measures run after train? #4

Closed zhengjian2322 closed 3 years ago

zhengjian2322 commented 3 years ago

Hello,I really enjoyed your paper! I hava a question with the code in nasbench2_train.py. Why does the fuction find_measures run after train.Will that affect the outcome?

mohsaied commented 3 years ago

Thanks! We used nasbench2_train.py to find the econas baselines. We also computed the zero-cost metrics in find_measures at each epoch to see how the metrics change over time. However, these results were not presented in the paper.

To compute the measures found in the paper, use nasbench2_pred.py.

zhengjian2322 commented 3 years ago

Thanks! We used nasbench2_train.py to find the econas baselines. We also computed the zero-cost metrics in find_measures at each epoch to see how the metrics change over time. However, these results were not presented in the paper.

To compute the measures found in the paper, use nasbench2_pred.py.

Thank you for your answer. I have another question, Why do you use abs when you calculate the Spearman coefficients in nasbench101.

mohsaied commented 3 years ago

Good observation. You don't need to use abs. It was just convienient when quickly plotting different metrics on the same plot and comparing them since some are negatively-correlated and others are positively-correlated.

zhengjian2322 commented 3 years ago

Good observation. You don't need to use abs. It was just convienient when quickly plotting different metrics on the same plot and comparing them since some are negatively-correlated and others are positively-correlated.

Thank you for your prompt reply! 1.Are all of the zero-cost proxies(snip, synflow ,jacob_cov) mentioned in the paper the bigger the better? 2.I used the code in this repository to reproduce the zero-cost proxies results of ASR(pytorch model in ASR github), but it did not achieve the results in the paper. Could you tell me how do you did it.

mohsaied commented 3 years ago

Hi again,

  1. Yes these three metrics, the bigger the better.
  2. For NAS-Bench-ASR we computed the metrics in the standard way, following the template provided for NAS-Bench-101 and NAS-Bench-201. Nothing is different. Can you please share more information on how your reporoduction of these results is different? For example, what correlation coefficient do you get?

I can look into rerunning the computation of proxies for NAS-Bench-ASR on my side and releasing an additional file for how we did it. It will probably take a few days for me to get to it though.

zhengjian2322 commented 3 years ago

``> Hi again,

  1. Yes these three metrics, the bigger the better.
  2. For NAS-Bench-ASR we computed the metrics in the standard way, following the template provided for NAS-Bench-101 and NAS-Bench-201. Nothing is different. Can you please share more information on how your reporoduction of these results is different? For example, what correlation coefficient do you get?

I can look into rerunning the computation of proxies for NAS-Bench-ASR on my side and releasing an additional file for how we did it. It will probably take a few days for me to get to it though.

Thanks, For the synflow, I only changed the function get_layer_metric_array in the file(p_utils) and get the network from the file(https://github.com/SamsungLabs/nb-asr/blob/main/nasbench_asr/model/torch/model.py) with different arch_desc .I found that the syflow value is much smaller than that of NASbench201 and also has a value for configurations that do not constitute a network(Because the last linear in ASR macro-architecture ).

The modified get_layer_metric_array(add nn.Conv1d) is here:

def get_layer_metric_array(net, metric, mode):
    metric_array = []
    for layer in net.modules():
        if mode == 'channel' and hasattr(layer, 'dont_ch_prune'):
            continue
        if isinstance(layer, nn.Conv2d) or isinstance(layer, nn.Linear) or isinstance(layer, nn.Conv1d):
            metric_array.append(metric(layer))

    return metric_array
mohsaied commented 3 years ago

a few changes may be required, let me reopen this issue and we'll look into it

zhengjian2322 commented 3 years ago

a few changes may be required, let me reopen this issue and we'll look into it

Thank you very much for your attention. I would appreciate it if you could provide your code for ASR task or the zero-cost proxies result for ASR.

vaenyr commented 3 years ago

Hello, I know Mohamed and Abhinav are working on adding proper support for nb-asr in the code. In a meantime, I've added pickle files containing precomputed metrics to the google drive (https://drive.google.com/drive/folders/1fUBaTd05OHrKIRs-x9Fx8Zsk5QqErks8?usp=sharing). These are the files we used to run NAS experiments, although it's been a while so please let us know if there are any problems. Hope that helps!

zhengjian2322 commented 3 years ago

Hello, I know Mohamed and Abhinav are working on adding proper support for nb-asr in the code. In a meantime, I've added pickle files containing precomputed metrics to the google drive (https://drive.google.com/drive/folders/1fUBaTd05OHrKIRs-x9Fx8Zsk5QqErks8?usp=sharing). These are the files we used to run NAS experiments, although it's been a while so please let us know if there are any problems. Hope that helps!

Thank you so much, the asr result you shared is very helpful for me.

zhengjian2322 commented 3 years ago

I have another question again. Why BP outperforms BP+warmup(256) after 200 trained models in the paper(Figure 4 d), what do you think is the reason?

vaenyr commented 3 years ago

Hi, I haven't checked the results carefully so take my words with a pinch of salt, but from my experience the difference like that is usually not meaningful - it can very well be just a result of averaging, as suggested by the fact that eventually both methods are pretty close and well within each other's IQR (on the other hand the difference between the two methods in the range of 0-100 models seems much more significant to me as the IQRs are much more disjoint). Alternatively, if we assume that the difference is meaningful, a valid hypothesis would probably be that warming up a predictor makes later parts of the predicted ranking worse, while improving the earlier ones. To test it, one could try to train a predictor and get overall ranking correlation with and without zero-cost warmup, and compare. What is more, I would also try to link it to the warmup sample size (we can see that warmup with 512 does significantly better) - since the sample is completely random it is possible that average performance of the warmed up predictor, in the latter parts of the ranking, depends more on the quality of the initial samples. Some possible tests regarding this part would include trying to warm up predictors with some cherry-picked samples (e.g., only bad models, only good models, etc.) and maybe trying to use a similar iterative scheme for warmup as the one we use for accuracy prediction (to maximize a chance of having good models in the warmup pool).

zhengjian2322 commented 3 years ago

Hi, I haven't checked the results carefully so take my words with a pinch of salt, but from my experience the difference like that is usually not meaningful - it can very well be just a result of averaging, as suggested by the fact that eventually both methods are pretty close and well within each other's IQR (on the other hand the difference between the two methods in the range of 0-100 models seems much more significant to me as the IQRs are much more disjoint). Alternatively, if we assume that the difference is meaningful, a valid hypothesis would probably be that warming up a predictor makes later parts of the predicted ranking worse, while improving the earlier ones. To test it, one could try to train a predictor and get overall ranking correlation with and without zero-cost warmup, and compare. What is more, I would also try to link it to the warmup sample size (we can see that warmup with 512 does significantly better) - since the sample is completely random it is possible that average performance of the warmed up predictor, in the latter parts of the ranking, depends more on the quality of the initial samples. Some possible tests regarding this part would include trying to warm up predictors with some cherry-picked samples (e.g., only bad models, only good models, etc.) and maybe trying to use a similar iterative scheme for warmup as the one we use for accuracy prediction (to maximize a chance of having good models in the warmup pool).

Thank you for your detailed answers. Zero-cost proxies is a very nice work, I will continue to pay attention to it.

mohsaied commented 3 years ago

closing this issue as immediate issues seem to be solved. It still remains to provide implementations for NAS-Bench-ASR/NLP but these are covered by other issues.