Closed ikmckenz closed 1 year ago
I've have tested this in the horde worker, it works as expected.
It is backwards compatible with the current md5 implementation and by default changes nothing. Only once the model database supports the attribute "sha256sum" will this code actually act on it, just doing exactly what the md5 check code does. ✅
This will be handy when safetensors support is added, as Civitai displays sha256 hashes of the safetensor by default.
For reference I injected this into the worker dependencies directly adding to the requirements.txt: git+https://github.com/db0/nataili.git@refs/pull/4/merge#egg=nataili
Nataili currently uses MD5 for its hashing algorithm, which is a security risk. The MD5 algorithm is no longer considered resistant to collisions as of 2010.
This security risk is probably acceptable in many cases, but it should not be considered acceptable for distributing .cpkt files. Checkpoint files are expected to be many GBs of random-looking data, which is the ideal file contents to distribute malware along with GBs of random data that happens to generate an MD5 collision.
A move to SHA256 could greatly help reduce this threat. SHA256 is secure for untrusted data, commonly used, and available in standard library. SHA256 is also implemented by HuggingFace, so perhaps there could be some code written to check calculated hashes against HuggingFace hashes when hashes are not available for a model downloaded from HF. There may be some debate over whether Nataili should use a faster hash function like BLAKE3, but BLAKE3 is not in the standard library and I didn't want to introduce any new dependencies.
This PR does not break current behaviour. It does not require a file to have an associated hash (which is a security risk), and also falls back to MD5 when SHA256 is not available. In the future perhaps this behaviour can be changed to be more strict.
Unit tests are provided but are not comprehensive. I did not see much testing around the model manager code.