ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
80 stars 34 forks source link

Consider change to hashing algorithm for digest #460

Closed tnavatar closed 9 months ago

tnavatar commented 9 months ago

The main driver of performance when creating VRS entities is the speed of the digest algorithm. There are two main contributors to this: one is the speed of rendering the entity as canonical JSON, per the spec, the other is the speed of the hash algorithm itself. The first can be significantly mitigated by creating a string template for each digestable VRS entity (thanks to Reece Hart for this suggestion). The second is determined by the selection of hash algorithm. A cryptographically secure hash algorithm (SHA-512) was chosen as the basis for prior versions of VRS. Speed is not of particular importance for cryptographic hashes; often such algorithms are intentionally slow--see some of the key derivation functions based off the SHA family of hashes. As a consequence, SHA-512 is slower than many non-cryptographic hashes by more than an order of magnitude.

The collision-avoidance property of cryptographic hashes is important, but non-cryptographic hashes can also offer acceptable performance in this regard. In creating the digest for the original version of VRS it was considered acceptable to sacrifice some collision resistance strength in order to shorten the digest. Faster, non-cryptographic hashes offer similar width and (presumably) similar resistance to collisions. There are many such candidate algorithms that are performant against hash collisions.

I'm proposing replacing the hash algorithm for digests in VRS 2.0 with something much faster than SHA-512. The xxhash algorithm linked earlier would likely be a reasonable choice, though it would be worth having a discussion on alternatives. The main selection criteria I see are:

If this group can think of other important criteria it would be great to hear them. As it stands, I don't hear concerns about a cryptographic attack leveraging VRS objects, but I do hear a lot of concerns about the performance of VRS when it comes to digesting large datasets, so I think this is a change worth considering.

ehclark commented 9 months ago

Based on the profiling data generated associated with this ticket https://github.com/ga4gh/vrs-python/issues/305, speeding up the serialization of the JSON object would have a significant impact; changing the hash algorithm probably not much. Here is a brief summary of my review of the profiling data. (But please review for yourself to verify.)

Starting from vrs_python method translate_vcf_row:

So speeding up the serialization could be significant. The model_dump portion of the serialization (which I think converts the model to a JSON string) is about 8% of the total time (25% of the serialization time). The remainder of the serialization process appears to be working with the model itself.

I personally don't have a strong opinion about which hash to use, though I would agree faster can't hurt. Though I would choose a slower hash that is widely supported over a faster one that is not widely supported. In my limited testing, xxhash is about 60% faster than sha512, which would yield about a 1% increase in overall VRS computation speed.

tnavatar commented 9 months ago

Based on the profiling data generated associated with this ticket https://github.com/ga4gh/vrs-python/issues/305, speeding up the serialization of the JSON object would have a significant impact; changing the hash algorithm probably not much. Here is a brief summary of my review of the profiling data. (But please review for yourself to verify.)

Thanks for the reference to your profiling data; it's definitely good for grounding this discussion.

Though I would choose a slower hash that is widely supported over a faster one that is not widely supported.

This seems like the crux of the tradeoff to me. SHA-512 is generally present in the standard library for most languages and environments, while xxhash has lots of libraries out there, but isn't 'standard' in the way SHA-512 is. That said, it is essential for lz4 compression, which seems to be becoming the standard 'fast' compression algorithm; I don't think xxhash is going to be any less significant going forward.

I do think there is a distinction between the performance characteristics of an implementation that uses VRS and performance limitations inherent to the spec itself. I don't think it takes too much imagination to think of ways allele normalization can be made faster, especially when working with large numbers of variants in batches, or that JSON serialization can be made faster (structuring VRS objects into a template rather than relying on a general JSON implementation, or just use a faster JSON implementation, for example), but VRS can only ever be as fast as the fastest implementation of the hash algorithm defined in the spec.

Which may be fast enough and may not be worth the tradeoff of adopting a less accessible algorithm. I think vrs-python is important for being the reference implementation of VRS, in that role it needs to be simple to read and understand; optimizing it for speed at the expense of readability would be a mistake. One imagines that more performant implementations would be created for different use cases, should this project be successful, and in that case the limitations of the spec itself would become more significant.

andreasprlic commented 9 months ago

I agree with @ehclark, from a different profiling experiment I also noticed that most of the performance is lost in JSON space. If our goal is to speed up vrs-python, it feels as if we should focus on that.

jsstevenson commented 9 months ago

To echo the above -- for kicks, I ran a few experiments, and I think the gap in performance is more or less fully dominated by serialization until you start trying to hash very large (ie 10^6+ character) objects, which is fairly atypical for VRS (your typical SNP is like 500 characters dumped).

https://github.com/jsstevenson/vrs-python-xxh/blob/main/xxh_experiments.ipynb

larrybabb commented 9 months ago

Another consideration that I just thought of was the fact that refGet adopted our digesting algorithm to standardize it across groups. I think if we changed we would have to figure out how to work with them on changing as well.

tnavatar commented 9 months ago

Interesting charts @jsstevenson. For VRS-sized inputs I wound up with something closer to @ehclark 's ~60% speed increase of xxhash vs sha-512 in experimenting just now (in Java, not Python, but still). Not nothing, but not the order of magnitude difference in some of the other benchmarks.

Given the data presented I might pull back from suggesting this change. I think this is not about vrs-python alone, to @andreasprlic 's point, and this might be a more significant change for other implementations with other performance characteristics, but it's hard to see it becoming a dominant factor for performance in any implementation based on this. It's also easier to be confident in SHA-512 remaining stable and accessible for the foreseeable future, which might be a more significant factor if this would not be an order of magnitude improvement in a critical path of the code.

reece commented 9 months ago

This comment just acknowledges that I'm watching with popcorn from the sidelines and enjoying all of the smart conversation.

The tradeoff between flexibility of data structure and speed is felt most acutely at serialization. It's a tough problem, and it seems like you all are doing the right thing to focus on the biggest opportunity (json serialization) first.

Good luck!

Mrinal-Thomas-Epic commented 9 months ago

If we're focusing on improving the performance of serialization rather than hashing, should we consider other approaches to serialization (e.g., protobuf)?

tnavatar commented 9 months ago

Glad for the conversation on this and thanks to everyone for their contributions. I feel like the performance of serialization is a vrs-python issue and not a spec issue, so I might close this ticket if there are no objects (or nobody else that wants to champion this).

Unless we're considering abandoning canonical-JSON as the basis for a hash as a solution, but that would be a whole other kettle of fish.