Open hackerfactor opened 2 years ago
Also, FotoForensics includes some "hostile" pictures. (JPEGs that break libjpeg, PNGs that break libpng, GIFs that break libgif, etc. People aren't nice to my site.) It will be interesting to see if Apple's code survives some of these hostile images.
Please go through the README if you want to extract the model. It's very simple and does not require any Apple devices.
TBH I don't think it makes much sense running the algorithm on millions of real-world images. The possibility of encountering a single collision is very slim considering that there are 2^96
possible hashes. What we need to break it is to "create" a collision instead of "find" one.
I believe the chances for one of those 5 million images to be a colission would be about 15 trillion to 1. So next to nothing.
(Take this with a grain of salt I'm not good at maths.)
I think it is actually very likely to have colliding hashes up to a few bits difference: this is the fundamental limitation of "trained" hashes. Judging from the network architecture and quantization technique my guess is 99% the large scale testing will find at least one 2 bit collision.
@hackerfactor
curl https://placekitten.com/200/140 > cat.jpg
docker run --rm --mount type=bind,source=`pwd`/cat.jpg,target=/cat.jpg aburgess/apple-neural-hash cat.jpg
The docker file is nothing special, though following the instructions to get the model was a bit of a hassle... Anyone can easily reuse the 2 files from this docker image.
from python:3-slim
add model.onnx neuralhash_128x96_seed1.dat /
add https://github.com/AsuharietYgvar/AppleNeuralHash2ONNX/raw/master/nnhash.py nnhash.py
run pip install --no-cache-dir onnxruntime pillow
entrypoint ["/usr/local/bin/python3", "nnhash.py", "model.onnx", "neuralhash_128x96_seed1.dat"]
I think collaborating would be fun.
My issues: I'm not much of a python person, my mac doesn't have their neuralhash files, and my iPhone isn't jail broken.
My offerings: I run FotoForensics and I have nearly 5 million pictures.
I read in the issues that the NCMEC hash is also able to be extracted from a jail-broken iPhone. That's the thing we should be testing against.
If someone can package everying up into a minimal docker image, point me to a simple command-line (find /mnt/allpictures -type f | run_this_command), then I can do a large-scale test using a real-world collection of pictures. This won't be an immediate result (depending on the code's speed, it might take weeks or months to run), but I can look for real-world false-positive matches.