To help verify data transfer it is desirable if a checksum existed for all files.
Background discussion in this thread.
Bearing in mind checksumming can be slow, we might explore what is possible with the built-in block checking of HDFS (based on CRC32C). Perhaps it's possible to merge them into a file-level checksum? From here:
By default when using Hadoop, all API-exposed checksums take the form of an MD5 of a concatenation of chunk CRC32Cs, either at the block level through the low-level DataTransferProtocol, or at the file level through the top-level FileSystem interface. A file-level checksum is defined as the MD5 of the concatenation of all the block checksums, each of which is an MD5 of a concatenation of chunk CRCs, and is therefore referred to as an MD5MD5CRC32FileChecksum. This is effectively an on-demand, three-layer Merkle tree.
To help verify data transfer it is desirable if a checksum existed for all files. Background discussion in this thread.
Bearing in mind checksumming can be slow, we might explore what is possible with the built-in block checking of HDFS (based on CRC32C). Perhaps it's possible to merge them into a file-level checksum? From here: