BIDData / BIDMat

A CPU and GPU-accelerated matrix library for data mining
BSD 3-Clause "New" or "Revised" License
265 stars 73 forks source link

hashing in BIDMat #39

Closed huitseeker closed 8 years ago

huitseeker commented 9 years ago

Grepping for hash instances in the project:

huitseeker@jollyjumper:~/tmp/BIDMat(master)$ ag Murmur -G'.*\.scala'
src/main/scala/BIDMat/DMat.scala
6:import scala.util.hashing.MurmurHash3
51:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/FMat.scala
7:import scala.util.hashing.MurmurHash3
49:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/GDMat.scala
16:import scala.util.hashing.MurmurHash3
46:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/GIMat.scala
13:import scala.util.hashing.MurmurHash3
45:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/GLMat.scala
10:import scala.util.hashing.MurmurHash3
49:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/GMat.scala
13:import scala.util.hashing.MurmurHash3
43:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/IMat.scala
5:import scala.util.hashing.MurmurHash3
42:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/LMat.scala
5:import scala.util.hashing.MurmurHash3
42:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/ND.scala
11:import edu.berkeley.bid.MurmurHash3
199:    MurmurHash3.MurmurHash3_x64_64(inds.map(_.GUID), 0x3142341)
203:    MurmurHash3.MurmurHash3_x64_64(inds.map(_.toLong), 0x3142341)
huitseeker@jollyjumper:~/tmp/BIDMat(master)$

I've found a lot of hashes on 32-bit Ints. Have you considered replacing Murmur3 with the faster xxHash ? https://github.com/Cyan4973/xxHash (as for the one usage in its 64-bit version, note the existence of XXH64)

jcanny commented 9 years ago

THose hashes are used to build GUIDs for matrices which are being cached. So there is always a matrix operation behind them which should take a lot more time than the hash itself. But if you have some evidence that its slowing things down let us know. Minimizing the collision probability seems like the most important thing.