AsuharietYgvar / AppleNeuralHash2ONNX

Convert Apple NeuralHash model for CSAM Detection to ONNX.
Apache License 2.0
1.53k stars 131 forks source link

Large scale testing #2

Open hackerfactor opened 2 years ago

hackerfactor commented 2 years ago

I think collaborating would be fun.

My issues: I'm not much of a python person, my mac doesn't have their neuralhash files, and my iPhone isn't jail broken.

My offerings: I run FotoForensics and I have nearly 5 million pictures.

I read in the issues that the NCMEC hash is also able to be extracted from a jail-broken iPhone. That's the thing we should be testing against.

If someone can package everying up into a minimal docker image, point me to a simple command-line (find /mnt/allpictures -type f | run_this_command), then I can do a large-scale test using a real-world collection of pictures. This won't be an immediate result (depending on the code's speed, it might take weeks or months to run), but I can look for real-world false-positive matches.

hackerfactor commented 2 years ago

Also, FotoForensics includes some "hostile" pictures. (JPEGs that break libjpeg, PNGs that break libpng, GIFs that break libgif, etc. People aren't nice to my site.) It will be interesting to see if Apple's code survives some of these hostile images.

AsuharietYgvar commented 2 years ago

Please go through the README if you want to extract the model. It's very simple and does not require any Apple devices.

TBH I don't think it makes much sense running the algorithm on millions of real-world images. The possibility of encountering a single collision is very slim considering that there are 2^96 possible hashes. What we need to break it is to "create" a collision instead of "find" one.

lukas-schwab commented 2 years ago

I believe the chances for one of those 5 million images to be a colission would be about 15 trillion to 1. So next to nothing.

(Take this with a grain of salt I'm not good at maths.)

ProfFan commented 2 years ago

I think it is actually very likely to have colliding hashes up to a few bits difference: this is the fundamental limitation of "trained" hashes. Judging from the network architecture and quantization technique my guess is 99% the large scale testing will find at least one 2 bit collision.

adamburgess commented 2 years ago

@hackerfactor

curl https://placekitten.com/200/140 > cat.jpg
docker run --rm --mount type=bind,source=`pwd`/cat.jpg,target=/cat.jpg aburgess/apple-neural-hash cat.jpg

The docker file is nothing special, though following the instructions to get the model was a bit of a hassle... Anyone can easily reuse the 2 files from this docker image.

from python:3-slim

add model.onnx neuralhash_128x96_seed1.dat /
add https://github.com/AsuharietYgvar/AppleNeuralHash2ONNX/raw/master/nnhash.py nnhash.py
run pip install --no-cache-dir onnxruntime pillow
entrypoint ["/usr/local/bin/python3", "nnhash.py", "model.onnx", "neuralhash_128x96_seed1.dat"]

apple-neural-hash.zip