aboutcode-org / commoncode

A library of common functions shared in many other AboutCode projects
3 stars 12 forks source link

Hash is unexpectedly None for empty files #69

Open stefan6419846 opened 1 week ago

stefan6419846 commented 1 week ago

Generating hashes for empty files will always return None, which is not documented and different from the usual hashing algorithms as well as contradicting the SPDX standard.

Example:

from commoncode.hash import sha1
from hashlib import sha1 as sha1_hashlib
from tempfile import NamedTemporaryFile

with NamedTemporaryFile() as temporary_file:
    temporary_file.write(b'')
    temporary_file.seek(0)
    print(sha1(location=temporary_file.name))
    print(sha1_hashlib(string=temporary_file.read(), used_for_security=False).hexdigest())

The reason seems to be that https://github.com/aboutcode-org/commoncode/blob/878be6140deac30e2b95fb0fad9eb8feca015fc8/src/commoncode/hash.py#L38 does not use msg is not None, but basically bool(msg), which is False for empty inputs as well.

Replacing the line with

            self.h = msg is not None and hmodule(msg).digest()[:self.digest_size] or None

(as well as replacing the same pattern in sha1_git_hasher) seems to fix this issue.

pombredanne commented 1 week ago

Stefan: Thanks for the report! This is intentional because there is a lot of emptiness out there! Returning None for empty files has been a design choice to avoid creating dummy checksums.

FWIW, this predates SPDX and this library is not SPDX specific .... With this said, this has not been documented correctly and should be documented. I will keep this open until we document this alright.

And for the SPDX compatibility, this is handled when creating SPDX documents in ScanCode Toolkit at https://github.com/aboutcode-org/scancode-toolkit/blob/498467c1e5f677edc20e92b12910f16d771ccad3/src/formattedcode/output_spdx.py#L255 where we effectively return a SHA1 for empty files there. (And in the future, other checksums, hardcoded in a similar fashion).

stefan6419846 commented 1 week ago

Thanks for the explanation. Then this apparently is just a matter of missing/improper documentation.