Galaxy Zoo - Githubissues

dylanrandle commented 4 years ago

Try to get state of art on Galaxy Zoo

JiaweiZhuang commented 4 years ago

Link to data: https://data.galaxyzoo.org

JiaweiZhuang commented 4 years ago

Kaggle challenge: https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge/overview Winner's code: https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge/overview/winners

JiaweiZhuang commented 4 years ago

I uploaded a cleaned-up & compressed version of the Galaxy data at:

https://www.kaggle.com/zhuangjw/galaxy-zoo-cleaned

The original *.jpg images took 13 minutes to read. The compressed NetCDF/HDF5 format (via Xarray) only takes 4 seconds :)

The compressed data are generated by those notebooks:

Then I tried a tiny CNN on the data, and got a training RMSE of ~0.09, validation RMSE of ~0.1, and a test RMSE of ~0.105 (obtained by uploading a CSV file to the original Kaggle challenge). Notebooks:

For reference, the top score on leaderboard is RMSE ~0.075. I think some data augmentation is needed to obtain such high score...

@memanuel Could you briefly summarize the results with DARTS? How do we get the test accuracy? I guess the only way is to upload CSV to Kaggle?

memanuel commented 4 years ago

i believe the only way to find a test accuracy is to run the model on the test images and upload it to kaggle. i have not yet done this.

JiaweiZhuang commented 4 years ago

Tried ResNet-18 and "ResNet-10" (defined in https://github.com/capstone2019-neuralsearch/AC297r_2019_NAS/issues/3#issuecomment-541216854); both can get a training RMSE of 0.06 by training for 15 epochs, and 0.04 for 30 epochs; but the validation RMSE stops at 0.1. Severe overfitting.

Notebook: https://www.kaggle.com/zhuangjw/galaxy-resnet-pytorch?scriptVersionId=22948695

DARTS might do a better job as it is sort of optimizing for validation loss. Not sure how far can it go without data augmentation (rotation, zoom-in, etc.) as used by winning solutions.

capstone2019-neuralsearch / AC297r_2019_NAS

Galaxy Zoo #18