return duplicate detection result by providing a file

SmartLayer commented 3 years ago

Is bees able to solve this problem:

For a newly created file, find out almost instantly if it is a duplicate of an existing file in a btrfs

I know bees would silently dedupe a newly created duplicate file during its normal operation, but what I need here is to tell if the file is duplicate, because if it is, it will not be added to the file system at all.

The use case is

when a user uploads a file, the webserver needs to tell if it is a duplicate and decline the upload if it is. Such an operation must be done quickly since a web user wouldn't wait at a turning hourglass for too long.

The web server uses btrfs to store the website's data under /var/www (the main file system / is not using btrfs).

Furthermore, given that bees maintains a hash table, does it mean the cums stored in btrfs is insufficient to determain - in real time - if a new block is a duplicate of any existing block in the system?

I thought that if, for two files with the same content (no sharing of extents), btrfs inspect-internal dump-csum would constantly dump the same array of csums, then a tool should be able to dump the csums of the newly uploaded file, and for every cums index-search for extends with the same csum in existing fs and therefore determine if the file is a duplicate, without the need of a hash database which potentially can go out of sync, correct?

kakra commented 3 years ago

I don't think that's a use-case for bees: You cannot use bees to deny duplicate user uploads.

I suggest researching a topic of content addressable storage instead: After receiving a web-upload, create an SHA-512 hash sum of the upload and store the file under that name, take note of the original name in a database that stores all the metadata (hash, upload date, original file name, whatever). If you want to allow multiple uploads of the same file but store them only once, add a column with a reference counter to that database so you know how often it is referenced. If you don't mind the original filename, just store under the hashed name, and on upload just check the existence. It doesn't make sense trying to exploit btrfs and bees for that. There's never a guarantee that bees would actually dedupe a file because it may contain toxic hashes, so that's not a good indicator for the policy you're trying to apply. Also, an new upload may be partially contained in an existing upload, resulting in a false alarm because bees would deduplicate it and trigger your "policy".

kakra commented 3 years ago

Furthermore, given that bees maintains a hash table, does it mean the cums stored in btrfs is insufficient to determain - in real time - if a new block is a duplicate of any existing block in the system?

Yes and no: The hash table may contain hashes to files that no longer exist, and it may forget hashes for files that didn't deduplicate since a while. Also, the hash table is about blocks, not files. Actually, bees doesn't care about files at all, it looks for duplicate chains of extents/blocks, then looks up a file by extent address to actually open that file - this can be any file that shares that extent.

The web-upload policy you're asking for is about files, bees cannot answer that, neither can btrfs: Both deduplicate extents, not files. And the hash table thus cannot be used as a reverse lookup table to find duplicate files. It can also only answer if a new block is a likely duplicate of a KNOWN block, it doesn't know about ANY block in the file system. Also, the block may have gone already, so it may actually no longer exist. Due to toxic extent hash counter-measurements, the new block may actually not be the only block in the FS with that csum, other copies may exist, too.

I thought that if, for two files with the same content (no sharing of extents), btrfs inspect-internal dump-csum

Bees maintains this hash database file because btrfs itself doesn't have a reverse lookup of csum to extents. Dumping a whole csum tree is a slow process, you could do that, but you probably don't want to do that for every uploaded file. Being able to dump the csum tree doesn't mean you could do a real-time lookup for a specific csum. It's just not how that information is addressed. Actually, that is what bees does: it makes csums addressable back to extents, but it needs to verify the information is still valid when it did because the hash tables and FS will be out-of-sync all the time.

BTW: There is no inspect-internal dump-csum...

Zygo commented 3 years ago

There is no inspect-internal dump-csum...

There is a patch in the dduper package which implements this feature. It is the core of the dduper implementation: dduper dumps out the csum tree entries for each file and stores them in memory, then performs a traditional hash table matching dedupe algorithm using the csums. It is prone to problems with race conditions, but they don't hurt dduper very much (only missed dedupes and scary error messages).

In addition to all the other stuff mentioned:

bees doesn't look at data quite in real time. Dedupe and writes lock each other out when both occur in the same file. bees stays behind intentionally by at least 10 transactions so that it is not delaying applications as they write data to files, or wasting time deduping data that is frequently overwritten.
the hashes and csums are too short to determine whether a block is identical on their own. The efficiency of dedupe gets worse as the hash/csum size gets larger, so these are kept at the minimum usable size and collisions are handled by comparing the data. bees only needs enough bits to determine whether it is worthwhile to try a byte-by-byte comparison (dduper does this too). btrfs only needs enough bits to determine whether a disk failure has occurred, and by default btrfs uses even shorter csums than bees hashes.

Even high-entropy files like JPEGs and video files occasionally have identical 4K blocks in them. Pictures with large dark areas usually encode as long runs of zeros or constant non-zero byte values. If you exclude files by the content of a single 4K block, you may have false positives and exclude files you did not intend. If the users uploading files know about your exclusion policy they may attempt to exploit it.

btrfs stores csums, but they are not indexed to be searchable. This is why bees needs a hash table--it is the logical opposite of the btrfs csum tree. The csum tree maps from block location to csum, and the bees hash table maps from csum to block location. There is no sane way to maintain a complete on-disk hash table at the block level (ZFS and duperemove are insane, VDO and bees don't maintain complete tables, dduper and the defunct kernel dedupe rebuild the entire hash table in RAM every time).

SmartLayer commented 3 years ago

I don't think that's a use-case for bees: You cannot use bees to deny duplicate user uploads.

I suggest researching a topic of content addressable storage instead

Your understanding of the use-case is perfect and thank you for sharing your thoughts on that. What I didn't mention is that there are other workflows than the website that involves these files and not all files are uploaded by users hence building a layer on top of the file systems breaks other uses of the same file repository.

Bees maintains this hash database file because btrfs itself doesn't have a reverse lookup of csum to extents.

Very important knowledge.

I found a quick solution for this problem. Having run through 10,000 media files averaged at 1GB (which is typical in my use-case), there are only 2 files with the same size but different content (while ~ 1000 duplicated files with same size and same content), and the entire 10,000 file sizes are returned in 1 second. So the web system will check for a file with the same size, and run a checksum if same size file is found. For a new unique video, there is 99.99…% chance that it has a unique size and will be determined as unique in 1 second. That is, if it is not exploited, but there is always a way to exploit (e.g. by uploading a new video with different metadata in the MP4 container) so this is mostly for user experience anyway.

kakra commented 3 years ago

Yes, file size is already a very good indicator if your files tend to be random in size. Comparing the remaining files can be fast, and you can further optimize that by maintaining some out-of-process indexer which watches for new files, and maintains an index of (path,filesize,hash). If you make that a background service, you could query it from both your web backend and your other users of the file repository. Some locking would be needed to avoid a race condition between detecting and new file and writing it to the index, so a web user cannot exploit the temporary missing knowledge of newly added files. Your current approach of searching the file tree and comparing the files has a similar problem. Keeping pre-calculated hashes in a index will speed up the process. You could also implement that index as a cache: If it's indexed, you're fine, if not, only then fall back to searching the file system, the store the result in the index.

Zygo / bees

return duplicate detection result by providing a file #173