Closed Hrant-Khachatrian closed 10 months ago
Thanks for your interest in the project!
Yes, we used 50 images in total, not per class. Note that the performance of FMoW is vastly worse when 50 examples are used compared to the whole dataset.
The specific code that filters the data down to 50 examples is in these lines. We sort the tiles by their hash so it is deterministic (at least when run on the same machine). I can work on getting lists for the downstream tasks.
Thanks for the quick response!
So you use md5(filename_0) ? I am confused if example_id is the filename from the folder of the downstream task, because sometimes it seems to refer to a folder name.
It would be great if you could provide the lists. We tried to run k-NN on frozen pretrained representations of one of our models (no fine-tuning on downstream) on two different subsets of size 50 and got difference like 0.35 vs 0.42 on UCMerced.
We can try to get the lists, but for a comparison experiment it would be best to fine-tune the two models in a consistent way. Even if you fine-tune on the same training examples, there may be differences in our evaluation framework (e.g. for FMoW we adopt a simpler classification-style evaluation to avoid the need for a custom head for that dataset) or data augmentations or backbone freezing/unfreezing or warmup learning rate that would not make the comparison consistent.
Also in the paper Table 3 we report the accuracy when fine-tuning on the entirety of each downstream dataset, in addition to using 50 random training examples. But still it would be better to fine-tune both models in a new experiment in a consistent way rather than compare to the numbers we report in our paper.
The fine-tuning could be either in this codebase (in which case your model weights would need to be adapted to a format that this code can load), or by loading the backbone and fine-tuning it with different code. We have Resnet50/Resnet152 backbones available for our Sentinel-2 models, in case that makes it easier to use the model in your existing code, although the Swin Transformer performs better in all of our benchmarks.
At any rate, here are the image IDs for the UC Merced land cover dataset that we use for 50 training examples, which corresponds our version of the dataset available at https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev/satlaspretrain_finetune.tar which has been pre-processed for compatibility with this codebase (I don't think the IDs here correspond to anything in the original UC Merced land use dataset):
["1074", "1099", "1106", "1113", "1118", "1227", "1240", "1282", "1291", "142", "1489", "1505", "1513", "1543", "1558", "1640", "1676", "168", "1692", "1719", "1725", "1739", "1760", "18", "1835", "1951", "1969", "1997", "2006", "2023", "2049", "2064", "2080", "2092", "245", "312", "335", "399", "441", "46", "490", "567", "654", "695", "720", "753", "81", "831", "912", "964"]
Here are the image IDs for AID, these do correspond to filenames in the original dataset.
["airport_179", "airport_229", "bareland_216", "bareland_243", "bareland_297", "bridge_175", "bridge_313", "commercial_138", "denseresidential_136", "farmland_148", "farmland_295", "farmland_315", "farmland_71", "forest_12", "industrial_389", "meadow_238", "meadow_44", "meadow_98", "mediumresidential_224", "mountain_160", "mountain_317", "mountain_86", "mountain_92", "park_168", "park_283", "park_319", "park_86", "parking_1", "parking_128", "parking_178", "parking_53", "playground_89", "pond_315", "port_322", "port_371", "railwaystation_158", "river_85", "school_153", "sparseresidential_106", "sparseresidential_286", "sparseresidential_291", "square_42", "square_6", "stadium_137", "stadium_171", "storagetanks_288", "storagetanks_310", "storagetanks_312", "viaduct_262", "viaduct_387"]
And for Mass Buildings and Mass Roads respectively, these also correspond.
["22678930_15", "22679020_15", "22679050_15", "22828915_15", "22828945_15", "22828960_15", "22829005_15", "22978870_15", "22978885_15", "22978975_15", "22979035_15", "23128885_15", "23128900_15", "23128945_15", "23129005_15", "23129035_15", "23129065_15", "23129155_15", "23278885_15", "23278930_15", "23278960_15", "23278975_15", "23278990_15", "23279005_15", "23279050_15", "23279080_15", "23279170_15", "23428915_15", "23428975_15", "23429035_15", "23429170_15", "23578915_15", "23578930_15", "23578990_15", "23579080_15", "23729080_15", "23729095_15", "23729110_15", "23878930_15", "23878945_15", "23878990_15", "23879020_15", "23879035_15", "23879050_15", "23879065_15", "24328840_15", "24328870_15", "24329020_15", "24329035_15", "24479005_15"]
["10228660_15", "10228675_15", "10528795_15", "10828645_15", "10828795_15", "10978645_15", "10978870_15", "11428675_15", "11578675_15", "11728675_15", "11728705_15", "12328735_15", "16078915_15", "17278780_15", "17428945_15", "17728735_15", "18928735_15", "20428945_15", "20729005_15", "21329035_15", "21479050_15", "22528915_15", "22529395_15", "22679020_15", "22828990_15", "22829470_15", "22829485_15", "22978840_15", "22978975_15", "23129110_15", "23129410_15", "23279110_15", "23428915_15", "23428975_15", "23578960_15", "23728930_15", "23878930_15", "24029215_15", "24328825_15", "24328870_15", "24329185_15", "24479245_15", "24629215_15", "24778780_15", "25379260_15", "25978735_15", "26128780_15", "26129245_15", "26728675_15", "26878705_15"]
But I would still recommend doing the fine-tuning of both models in a new experiment to make everything consistent, even if you cannot get the Swin Transformer model to work I think the Resnet50 backbone could be transferred. I guess if you don't use pytorch then it could be more difficult, we could look into whether there's a way to adapt the Resnet50 model for e.g. TensorFlow.
Regarding the 50 example setup, I believe reporting performance with different sized subsets of downstream datasets is common for computer vision pre-training work, see e.g. Figure 3 of SeCo paper. It can help to show the benefit of a certain pre-training method for downstream tasks with few examples, since if a downstream dataset has too many examples the pre-training has a reduced benefit. Pre-training also usually has reduced benefit for downstream detection/segmentation tasks compared to downstream classification tasks.
Let me know if you could use any other pointers on how to load the SatlasPretrain backbone and/or the lists above for the example IDs used for fine-tuning.
Thanks for the great project.
We are trying to compare our model performance to SatlasPretrain on downstream tasks. 50-example fine-tuning is an interesting setup that we didn't see anywhere else. Two quick questions: