gesistsa / sweater

👚 Speedy Word Embedding Association Test & Extras using R
GNU General Public License v3.0
27 stars 4 forks source link

speed claims? #14

Closed cmaimone closed 2 years ago

cmaimone commented 2 years ago

I'm not completely clear on what the speed claims are here. What is sweater faster than exactly?

I think benchmark.md is showing that implementing one method with rcpp is faster than implementing the same method in pure R or R running C code?

That doesn't seem sufficient for a speed claim in general, especially with that in the package name.

It's very possible I'm missing something though.

I'll note that, other than what's implied by the package name, the paper doesn't make specific speed/performance claims. Dropping the last sentence of the second paragraph of the paper would, I think, remove all reference to speed. If you don't want to go into full comparisons and benchmarking, it may just be enough to talk about how long sweater takes to do common things, and refrain from saying whether that's fast or slow. Having that information -- expected run times -- is very useful just by itself.

re: https://github.com/openjournals/joss-reviews/issues/4036

cmaimone commented 2 years ago

Documentation (readme.md) says: "or are speedy but accurate approximation of the original implementation proposed by Caliskan et al (2017)"

How are you determining that the approximations are accurate? Can you quantify the accuracy in any way?

chainsawriot commented 2 years ago

@cmaimone Regarding the speed claims, in openjournals/joss-reviews#3938 @kthyng also perceived the speed claims of this package as dubious. And thus I added a benchmark.

I still believe this package provides heavily optimized solutions, especially compared to the existing solutions out there. The comparison with native and compiled R looks artificial. How about the comparison between the equivalent Python's offering? I've updated the benchmark (in the paper branch) and compare sweater with wefe. sweater is >10x faster (analyzing 5GB of data, 1 min vs 12.5 mins), thanks to the heavily optimized read_word2vec.

https://github.com/chainsawriot/sweater/blob/paper/paper/benchmark.md

(I am also trying to add the original Java code by Caliskan. But I need some time.)

Having said that, I know that all benchmarks can lead to controversies. The "speedy" part of the name was chosen to indicate that the package has done some optimization in a low-level language (C++). I don't know how much optimization should be done to qualify this package as "speedy". And as things turn out, the "speedy" name is a mistake. I don't mind dropping all speed claims. I also don't mind renaming the long form of the package name too. I could rename the package to a recursive acronym, eg. "sweater: word embedding association test & extras".

Regarding the accuracy, the original exact test in Caliskan is not practical in real life applications, e.g. when the word set length is large than 10. It took an unrealistic amount of time to calculate even after C++ optimization (I can never finish one, but my estimation is a few day). I will add a vignette to explain how accurate it is.

TODOS

cmaimone commented 2 years ago

It's up to you which direction you go (more benchmarking vs. name change). I don't have strong opinions here (on direction or what's sufficient) -- mostly just going through the JOSS checklist, and addressing the performance claims point. Maybe @kbenoit (other reviewer) has some thoughts on this point?

chainsawriot commented 2 years ago

A Vignette has been added to explain the resampling test.

https://github.com/chainsawriot/sweater/blob/master/vignettes/resampling.Rmd