Closed luzai closed 4 years ago
Thanks for pointing it out. You are right. Besides, I found some other problems in testing simple_api.py, so I removed it for now. Just use python -u main.py --config experiments/emore_u200k_single/config.yaml
.
Thank you for your quick response!
I guess some other problems may include NearestNeighbors
from sklearn is not scalable enough. I will go further reading main.py
. Thank you for your nice code!
I updated the single model API. Please read the README and have a try.
Thank you so much for updating single model API! The results are reproducible and the sinple_api is quite flexible!
@XiaohangZhan is it possible to save the clustered images into different folders?.
Assuming that I saved the path of each image to the "list.txt" file. And, in case of images consists of more than one face by saving both the image path and the bounding box of each face
For example, my list.txt can be this format. where the list.txt file has the same length as the number of examples/faces.
/home/mubarak/face_clustering/koc/jpg/D0225_A0411_S07_R013.jpeg [605, 599, 922, 982] /home/mubarak/face_clustering/koc/jpg/D0225_A0411_S07_R013.jpeg [1573, 533, 1883, 915]
for now, your main.py script only provides meta.txt file as output, which is somehow difficult to interpret.
Thanks
Hi, The meta.txt contains clustering results in form of image labels. You can easily save your images into different folders according to the labels, i.e., the images with the same label should go to the same folder. The examples labeled as -1 are discarded. As for list.txt, the code does not read the content of this file. It only gets the length of the list. So as long as the length is correct, the content does not matter.
Hi, The meta.txt contains clustering results in form of image labels. You can easily save your images into different folders according to the labels, i.e., the images with the same label should go to the same folder. The examples labeled as -1 are discarded. As for list.txt, the code does not read the content of this file. It only gets the length of the list. So as long as the length is correct, the content does not matter.
Thanks for your reply, I already thought of this but I wanted to confirm with you.
Cheers!
Could you also clarify for me in the situation when I have more than one face in the image?
for example: /home/mubarak/face_clustering/koc/jpg/D0225_A0411_S07_R013.jpeg [605, 599, 922, 982] /home/mubarak/face_clustering/koc/jpg/D0225_A0411_S07_R013.jpeg [1573, 533, 1883, 915]
When there are multiple faces in an image, you need to prepare face features for each FACE, rather than each image. For example, crop the image into several face images according to the bounding boxes, and use a trained face recognition network to extract features. And again, each line of the output meta.txt is corresponding to each face, which is also consistent with your list file.
When there are multiple faces in an image, you need to prepare face features for each FACE, rather than each image. For example, crop the image into several face images according to the bounding boxes, and use a trained face recognition network to extract features. And again, each line of the output meta.txt is corresponding to each face, which is also consistent with your list file.
Thanks, I had thought of this as well, but wanted to get a shorter way. Anyhow, thank you very much, I appreciate your work.
Dear @XiaohangZhan,
I have successfully clustered my images to the folders they belong. As you mentioned the examples labeled as -1 are discarded. But, is the examples labeled as 0 also the same?. because the images in 0's folder are very far from each other ( same as -1 label). The rest of the labels seems to be okay.
If you want I can share with you the folders (in case you want to check with me), but I don't know why folder/label 0 is like this.
Those labeled 0 should have been a cluster. It indicates the result is not good enough. I recommend you create a small test set to find the best hyper-params for your data. And also try some baseline methods using run_baselines.sh
@XiaohangZhan Do I need to retrain the mediator model(in labeled/emore_l200k/models/k15_110.pth.tar ) when extracting features using my own model?
Yes, sure. The training features and testing features should be extracted using the same model.
I only need to set force_retrain: True to train the mediator model? Do I need to modify other hyper parameters? @XiaohangZhan
I updated the README. See step 7 under Using your own data
.
If you are using mediator mode, please specify "train_data_name" as your data, e.g., "labeled/mydata". You may also want to adjust the "threshold" to obtain a good result on your validation set. The "force_retrain" can always be False. In this way, the system at first tries to find the model, e.g., "labeled/mydata/models/k15_111.pth.tar". If it does not exist, then the system will train this model using the training set. Then, when you want to adjust other parameters, this trained model will be loaded rather retrained.
@XiaohangZhan Can I use simple_api.py to process 10 million data?
Just have a try if your server's memory is enough. Remember to pull the latest single_api.py which uses NMSLIB for KNN computation.
@XiaohangZhan hi, if memory error when retraining mediator model,do u have the better solution to solve it except PCA-reduced or reducing pairs number. On condition that my server's memory is unchangeable. Thanks.
@XiaohangZhan hi, if memory error when retraining mediator model,do u have the better solution to solve it except PCA-reduced or reducing pairs number. On condition that my server's memory is unchangeable. Thanks.
You may split your data into several batches so that each batch could fit in your memory. However, it will degrade the recall. You can have a try to see whether it matters.
@XiaohangZhan Thanks for yr reply, i'll have a try.
@XiaohangZhan 你好,我对于eval的recall和precision的计算方法有点不理解,请问这是聚类算法的通用评价方法么?
@XiaohangZhan 你好,我对于eval的recall和precision的计算方法有点不理解,请问这是聚类算法的通用评价方法么?
请参考:https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
@XiaohangZhan 多谢!
@XiaohangZhan hello,请问下如何理解cdp映射到线性复杂度这个说法?
@XiaohangZhan hello,请问下如何理解cdp映射到线性复杂度这个说法?
比如N个sample的话,knn graph的edge数量<=Nxk,其中k是固定值,一般是10-40,远小于N。cdp的操作都是基于edge,包括edge信息的采集,edge分类,到此为止复杂度是O(N)。后续的propagation步骤执行的主要是BFS,也是O(N)。
@XiaohangZhan 理解了,多谢;那么,可以理解cdp的复杂度和k-means是相同的,对吧。
@XiaohangZhan 理解了,多谢;那么,可以理解cdp的复杂度和k-means是相同的,对吧。
额...kmeans不是O(n)...
@XiaohangZhan 我可以对复杂度理解的不好,kmeans复杂度是O(tKmn),也是线性复杂度,不能简化到O(n)么?
迭代次数和centroids数量一般是会和sample数量n相关的。假设100个点,centroids数量为10,迭代为5,那100万个点,不可能centroids数量还是为10,迭代还是5吧?
Thank you for your great paper and repo!
I think to reproducing results of
k15_accept0_th0.66
model, i.e.,, we may not need to normalize the distance as shown here.