Daninet / hash-wasm

Lightning fast hash functions using hand-tuned WebAssembly binaries
https://npmjs.com/package/hash-wasm
Other
859 stars 49 forks source link

Performance issue #1

Closed aleksey-hoffman closed 4 years ago

aleksey-hoffman commented 4 years ago

Hello, thank you for creating this module. I'm using it in an Electron app.

I'm trying to figure out why hash-wasm is not performing as expected, it's quite slow at hashing files.

I hashed a 2.5GB .zip file and got unexpected results - hash-wasm with xxhash64 algorithm performed almost exactly the same as the Nodejs crypto module with md5 algorithm :

And then I also hashed a 300MB zip archive located on an SSD (500MB/s reads) just to make sure the drive is not the botleneck:

I also hashed a 5MB image and got similar results (located on an SSD as well):

async getFileHash (path) {
    // Method 1: Nodejs crypto module md5
    // return new Promise(resolve => {
    //  const hash = crypto.createHash('md5)
    //  fs.createReadStream(path)
    //    .on('data', data => hash.update(data))
    //    .on('end', () => resolve(hash.digest('hex')))
    // })

    // Method 2: hash-wasm xxhash64
    const xxhash64 = await createXXHash64()
    return new Promise((resolve, reject) => {
      xxhash64.init()
      fs.createReadStream(path)
        .on('data', data => xxhash64.update(data))
        .on('end', () => resolve(xxhash64.digest('hex')))
    })
}

console.time('TIME | hash')
await getFileHash(path)
console.timeEnd('TIME | hash')

Do you know what might be causing the problem? Drive speed? Node's streams? I don't get why the results are so similar. Isn't hash-wasm xxhash64 supposed to be like 5 times faster especially with big files?

I tried changing the highWaterMark option to read data in 8MB chunks and maximize drive usage, thinking that file stream might be the bottleneck here, but it didn't help in this situation, if anything, the time went up from 25s to 27s:

fs.createReadStream(path, { highWaterMark: 8 * 1024 * 1024 })

(I tried changing this option since it helped in another unrelated case, when I used readStream().pipe(writeStream))

Electron is not the problem here since I'm seeing the same results when I run the code from a terminal (node v13.5.0).

Daninet commented 4 years ago

Are you sure that your SSD reads consistently with 500MB/s? Also it can happen that a process (like an antivirus) is interfering with the disk I/O performance. Measure reading that file with Node.js without running any hash functions.

I just made a test on my computer with a 4GB file. (NVMe SSD + Node.js v12.16.1 + Windows 10.0.18362 + i7-7700K CPU) 3540ms using hash-wasm with xxhash64 algorithm. (~1157 MB/s) 7842ms using Nodejs crypto module with md5 algorithm. (~522.44 MB/s)

My source code:

const { createXXHash64 } = require('hash-wasm');
const fs = require('fs');
const crypto = require('crypto');

async function getFileHash (path) {
  // Method 1: Nodejs crypto module md5
  // return new Promise(resolve => {
  //  const hash = crypto.createHash('md5');
  //  fs.createReadStream(path)
  //    .on('data', data => hash.update(data))
  //    .on('end', () => resolve(hash.digest('hex')))
  // })

  // Method 2: hash-wasm xxhash64
  const xxhash64 = await createXXHash64()
  return new Promise((resolve, reject) => {
    xxhash64.init()
    fs.createReadStream(path)
      .on('data', data => xxhash64.update(data))
      .on('end', () => resolve(xxhash64.digest('hex')))
  })
}

async function run() {
  console.time('TIME | hash')
  console.log(await getFileHash('../file'));
  console.timeEnd('TIME | hash')
}

run();
aleksey-hoffman commented 4 years ago

@Daninet I've done some more testing on Ubuntu and on Windows 10 with turned off "Windows defender real-time protection" and it seems like I/O interference from the system was rarely playing a role. Mostly, the speed difference between hashing functions was noticable when hashing big files located on an SSD. I didn't see any difference between crypto md5 and hash-wasm xxhash64 for small image files and files located on an HDD, though. It seems like the read stream speed might be the limiting factor here.

What I still don't understand is why the results are so similar for small files the files located on an HDD. If totalTime (ms) = streamRead (ms) + hashing (ms) why was I consistently getting similar times for both hashing functions if one algorithm is faster then the other and streamRead is supposed to take the same amount of time since it's the same file on the same HDD drive. Perhaps, for some files the hashing speed is the same for both algorithms or it's just limited by the readStream somehow.

I'm gonna close the issue since I'm not sure where to go from here.

Daninet commented 4 years ago

The hashing speed should be a constant, regardless of the file contents. On modern computers, it can be considered that the CPU works in parallel with the disk I/O. So the time can be approximated like this: totalTime = hashInitTime (1-5ms) + max(streamReadTime, hashCalculationTime)