facebookresearch / StarSpace

Learning embeddings for classification, retrieval and ranking.
MIT License
3.94k stars 531 forks source link

Random seed initialization #185

Closed fedorn closed 5 years ago

fedorn commented 6 years ago

Hello! Is it possible to add the ability to initialize random seed for better reproducibility of results obtained with StarSpace?

jwijffels commented 6 years ago

+1 for having such a feature!

ledw commented 6 years ago

@fedorn Thanks. We'll consider to add that.

ledw commented 5 years ago

@fedorn @jwijffels We can add the ability to initialize random seed, however, you need to set -thread 1 in order to obtain reproducibility as we use hogwild in training and it's not thread safe. Do you think it's still useful to have the feature running with only one thread?

fedorn commented 5 years ago

@ledw No, that probably wouldn't be useful for me. Thank you for looking into it.

jwijffels commented 5 years ago

No. Neither for me.

On the same subject, I am trying to remplace in the r package all calls to rand/srand/random_shuffle with r variants as that is required to get the r package on cran. After I have done all these rand/srand/random_shuffle replacements and I’ve set the seed fixed, the embeddings are reproducible over different runs with the same starting seed if I set thread=1 or thread=2. On the contrary if I set it to more than 2 threads, the embeddings are different for different runs with the same starting seed. It looks like I'm bumping against the remark that you made on Hogwild. Before I push this code on the R package repository, I would like to have an understanding why this is the case. What would be required to make this hogwild thread-safe. Could you point me where in the c++ code this is happening, such that maybe I can fix it, such that at least in the R package it is thread-safe? If you want me to make a new ticket for this so that you can close this one, let me know.

ledw commented 5 years ago

@jwijffels: I'm afraid that this is something we couldn't fix: the hogwild is a lock-free algorithm that let each thread modify the embeddings without locks. I think if we implement lock-safe algorithm to update the embedding tables, then it's likely to be much slower. I'd like to close the task for now.

jwijffels commented 5 years ago

Ok, thank you for the feedback on the reason, that makes things clear.