aappleby / smhasher

Automatically exported from code.google.com/p/smhasher
2.63k stars 467 forks source link

Clarify the units in MH3 wiki performance section #40

Open sarahgerweck opened 7 years ago

sarahgerweck commented 7 years ago

The Wiki page indicates that the performance for MurmurHash3_x64_128 is 5058 mb/sec. This is less than one byte per second. In units, case matters. The prefix m is 10^-3 and a lowercase b is bits.

I'm not just being pedantic: I don't know what those numbers actually mean. Is that supposed to be megabytes (MB = 8 10^6 bits) or megabits (Mb = 10^6 bits)? Both are pretty common when talking about streams. Or are those actually mebibytes (MiB = 8 2^20 bits) or mebibits (Mib = 2^20 bits)?

I would guess that the correct number is actually 5058 MB/s, 5058 Mb/s or 5058 MiB/s.

There are some other places where the units are slightly off but still clear, but it's important to differentiate upper– and lowercase with M/m and B/b, and it's important to use the IEEE prefixes Mi, ki, Gi if you're using binary units instead of decimal. I would be happy to make a pull request that corrects the units if you let me know which ones they actually are. 😄

lemire commented 7 years ago

After years trying to submit research papers with the IEEE prefixes and being told by editors to revert them back... and I do mean years... I think that there are very reasonable reasons for not using them even when they are appropriate. In many communities, they are simply not accepted.

sarahgerweck commented 7 years ago

@lemire This is a bit of a tangent, unless you're suggesting that this document should keep using mb/sec.

If you want to chat, I'm not arguing for being inflexibly dogmatic about this, but every standards body out there says that the SI prefixes should never be used in any way other than their standard base-10 definition. It's not just IEEE: ANSI, NIST, ISO, IEC & BIPM all agree on this point and forbid the use of base-2 unit prefixes when not rendered in the IEEE style.

Personally, if someone asked me not to use the IEEE units, I would propose to switch to the SI units rather than using IEEE units with SI prefixes. (E.g., use 10^6 instead of 2^20.) In a document like this one, base-10 units are just as defensible as base-2 units since network speeds are generally rendered in base-10. If some publication insists that its authors use the very troublesome base-2 units, I would add a footnote. (It's probably a best practice to include a footnote anyway unless you're using the IEEE prefixes, just to avoid any possibility of confusion.)

lemire commented 7 years ago

@sarahgerweck It is a tongue-in-cheek tangent, yes.

every standards body out there says that the SI prefixes should never be used in any way other than their standard base-10 definition. It's not just IEEE: ANSI, NIST, ISO, IEC & BIPM all agree on this point and forbid the use of base-2 unit prefixes when not rendered in the IEEE style.

And I happen to agree.

What I am saying, in all seriousness, is that some people do disagree and consider such distinctions as pedantic nonsense.

I disagree with them and agree with you.

sarahgerweck commented 7 years ago

@lemire I'm glad we're on the same page. 😄

My view is always to do your best to educate people about the right way, and to make the right way your starting point, but it's not worth stressing out about. I probably wouldn't have opened an issue at all if the units were clear (or even if it was only a question of base-2 units vs base-10 units). The bits vs bytes ambiguity is more troubling to me here. (My guess is that these numbers are MiB/s, but I wouldn't be at all surprised it they are actually Mb/s.)

lemire commented 7 years ago

@sarahgerweck

I would argue that a better measure is the number of bytes processed per CPU cycles (or number of bytes processed per CPU cycles)...

E.g., see https://arxiv.org/pdf/1609.09840 https://arxiv.org/pdf/1503.03465 ...

Tongue-in-cheek: http://lemire.me/blog/2012/07/03/bytes-or-octets/