About reproducing `k15_accept0_th0.66` by simple_api.py

luzai commented 5 years ago

Thank you for your great paper and repo!

I think to reproducing results of k15_accept0_th0.66 model, i.e.,

strategy	#model	setting	prec, recall, fscore
vote	1	k15_accept0_th0.66	89.35, 88.98, 89.16

, we may not need to normalize the distance as shown here.

XiaohangZhan commented 5 years ago

Thanks for pointing it out. You are right. Besides, I found some other problems in testing simple_api.py, so I removed it for now. Just use python -u main.py --config experiments/emore_u200k_single/config.yaml.

luzai commented 5 years ago

Thank you for your quick response!

I guess some other problems may include NearestNeighbors from sklearn is not scalable enough. I will go further reading main.py. Thank you for your nice code!

XiaohangZhan commented 5 years ago

I updated the single model API. Please read the README and have a try.

luzai commented 5 years ago

Thank you so much for updating single model API! The results are reproducible and the sinple_api is quite flexible!

engmubarak48 commented 5 years ago

@XiaohangZhan is it possible to save the clustered images into different folders?.

Assuming that I saved the path of each image to the "list.txt" file. And, in case of images consists of more than one face by saving both the image path and the bounding box of each face

For example, my list.txt can be this format. where the list.txt file has the same length as the number of examples/faces.

/home/mubarak/face_clustering/koc/jpg/D0225_A0411_S07_R013.jpeg [605, 599, 922, 982] /home/mubarak/face_clustering/koc/jpg/D0225_A0411_S07_R013.jpeg [1573, 533, 1883, 915]

for now, your main.py script only provides meta.txt file as output, which is somehow difficult to interpret.

Thanks

XiaohangZhan commented 5 years ago

Hi, The meta.txt contains clustering results in form of image labels. You can easily save your images into different folders according to the labels, i.e., the images with the same label should go to the same folder. The examples labeled as -1 are discarded. As for list.txt, the code does not read the content of this file. It only gets the length of the list. So as long as the length is correct, the content does not matter.

engmubarak48 commented 5 years ago

Hi, The meta.txt contains clustering results in form of image labels. You can easily save your images into different folders according to the labels, i.e., the images with the same label should go to the same folder. The examples labeled as -1 are discarded. As for list.txt, the code does not read the content of this file. It only gets the length of the list. So as long as the length is correct, the content does not matter.

Thanks for your reply, I already thought of this but I wanted to confirm with you.

XiaohangZhan commented 5 years ago

Cheers!

engmubarak48 commented 5 years ago

Could you also clarify for me in the situation when I have more than one face in the image?

for example: /home/mubarak/face_clustering/koc/jpg/D0225_A0411_S07_R013.jpeg [605, 599, 922, 982] /home/mubarak/face_clustering/koc/jpg/D0225_A0411_S07_R013.jpeg [1573, 533, 1883, 915]

XiaohangZhan commented 5 years ago

When there are multiple faces in an image, you need to prepare face features for each FACE, rather than each image. For example, crop the image into several face images according to the bounding boxes, and use a trained face recognition network to extract features. And again, each line of the output meta.txt is corresponding to each face, which is also consistent with your list file.

engmubarak48 commented 5 years ago

When there are multiple faces in an image, you need to prepare face features for each FACE, rather than each image. For example, crop the image into several face images according to the bounding boxes, and use a trained face recognition network to extract features. And again, each line of the output meta.txt is corresponding to each face, which is also consistent with your list file.

Thanks, I had thought of this as well, but wanted to get a shorter way. Anyhow, thank you very much, I appreciate your work.

engmubarak48 commented 5 years ago

Dear @XiaohangZhan,

I have successfully clustered my images to the folders they belong. As you mentioned the examples labeled as -1 are discarded. But, is the examples labeled as 0 also the same?. because the images in 0's folder are very far from each other ( same as -1 label). The rest of the labels seems to be okay.

If you want I can share with you the folders (in case you want to check with me), but I don't know why folder/label 0 is like this.

XiaohangZhan commented 5 years ago

Those labeled 0 should have been a cluster. It indicates the result is not good enough. I recommend you create a small test set to find the best hyper-params for your data. And also try some baseline methods using run_baselines.sh

felixfuu commented 5 years ago

@XiaohangZhan Do I need to retrain the mediator model(in labeled/emore_l200k/models/k15_110.pth.tar ) when extracting features using my own model?

XiaohangZhan commented 5 years ago

Yes, sure. The training features and testing features should be extracted using the same model.

felixfuu commented 5 years ago

I only need to set force_retrain: True to train the mediator model? Do I need to modify other hyper parameters? @XiaohangZhan

XiaohangZhan commented 5 years ago

I updated the README. See step 7 under Using your own data. If you are using mediator mode, please specify "train_data_name" as your data, e.g., "labeled/mydata". You may also want to adjust the "threshold" to obtain a good result on your validation set. The "force_retrain" can always be False. In this way, the system at first tries to find the model, e.g., "labeled/mydata/models/k15_111.pth.tar". If it does not exist, then the system will train this model using the training set. Then, when you want to adjust other parameters, this trained model will be loaded rather retrained.

felixfuu commented 5 years ago

@XiaohangZhan Can I use simple_api.py to process 10 million data?

XiaohangZhan commented 5 years ago

Just have a try if your server's memory is enough. Remember to pull the latest single_api.py which uses NMSLIB for KNN computation.

qiudi0127 commented 4 years ago

@XiaohangZhan hi, if memory error when retraining mediator model，do u have the better solution to solve it except PCA-reduced or reducing pairs number. On condition that my server's memory is unchangeable. Thanks.

XiaohangZhan commented 4 years ago

@XiaohangZhan hi, if memory error when retraining mediator model，do u have the better solution to solve it except PCA-reduced or reducing pairs number. On condition that my server's memory is unchangeable. Thanks.

You may split your data into several batches so that each batch could fit in your memory. However, it will degrade the recall. You can have a try to see whether it matters.

qiudi0127 commented 4 years ago

@XiaohangZhan Thanks for yr reply, i'll have a try.

qiudi0127 commented 4 years ago

@XiaohangZhan 你好，我对于eval的recall和precision的计算方法有点不理解，请问这是聚类算法的通用评价方法么？

XiaohangZhan commented 4 years ago

@XiaohangZhan 你好，我对于eval的recall和precision的计算方法有点不理解，请问这是聚类算法的通用评价方法么？

请参考：https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

qiudi0127 commented 4 years ago

@XiaohangZhan 多谢！

qiudi0127 commented 4 years ago

@XiaohangZhan hello，请问下如何理解cdp映射到线性复杂度这个说法？

XiaohangZhan commented 4 years ago

@XiaohangZhan hello，请问下如何理解cdp映射到线性复杂度这个说法？

比如N个sample的话，knn graph的edge数量<=Nxk，其中k是固定值，一般是10-40，远小于N。cdp的操作都是基于edge，包括edge信息的采集，edge分类，到此为止复杂度是O(N)。后续的propagation步骤执行的主要是BFS，也是O(N)。

qiudi0127 commented 4 years ago

@XiaohangZhan 理解了，多谢；那么，可以理解cdp的复杂度和k-means是相同的，对吧。

XiaohangZhan commented 4 years ago

@XiaohangZhan 理解了，多谢；那么，可以理解cdp的复杂度和k-means是相同的，对吧。

额...kmeans不是O(n)...

XiaohangZhan commented 4 years ago

@XiaohangZhan 我可以对复杂度理解的不好，kmeans复杂度是O(tKmn)，也是线性复杂度，不能简化到O(n)么？

迭代次数和centroids数量一般是会和sample数量n相关的。假设100个点，centroids数量为10，迭代为5，那100万个点，不可能centroids数量还是为10，迭代还是5吧？

XiaohangZhan / cdp

About reproducing `k15_accept0_th0.66` by simple_api.py #12