facebook / ThreatExchange

Trust & Safety tools for working together to fight digital harms.
https://developers.facebook.com/docs/threat-exchange
Other
1.16k stars 307 forks source link

[pytx] No match results if creating a local_file with only 1 hash in it #1318

Open Dcallies opened 1 year ago

Dcallies commented 1 year ago

Repo:

$ tx hash photo pdq/data/bridge-mods/aaa-orig.jpg >> local_file.txt
$ tx config collab edit local_file file_backed_bank.txt --filename ~/file_backed_bank.txt  --create
$ tx fetch
$ tx match photo pdq/data/bridge-mods/aaa-orig.jpg 

Expected: any matches However, oddly enough, adding a second hash allows all hashes to to match

$ tx hash photo pdq/data/bridge-mods/blur-a-little.jpg >> local_file.txt
$ tx fetch
$ tx match -A photo pdq/data/bridge-mods/aaa-orig.jpg 
pdq 4 (file_backed_bank.txt) INVESTIGATION_SEED
pdq 0 (file_backed_bank.txt) INVESTIGATION_SEED
jagraff commented 6 months ago

This bug shows up in HMA as well, adding repro in case it's helpful. For this repro to work, HMA (previously OpenMediaMatch) should be running as a docker container and serving localhost:8080

Reset the tables:

$ docker-compose exec app flask --app OpenMediaMatch.app reset_all_tables
[2024-03-13 18:10:12,303] WARNING in app: No storage class provided, using the default

Create a bank:

$ curl --location 'localhost:8080/c/banks' \
--header 'Content-Type: application/json' \
--data '{
    "name": "EVIL_CONTENT_BANK"
}'
{"matching_enabled_ratio":1.0,"name":"EVIL_CONTENT_BANK"}'

Add a file to the bank:

$ curl --location 'localhost:8080/c/bank/EVIL_CONTENT_BANK/content' \
--form 'photo=@"<photo path>"'
{"id":1,"signals":{"pdq":"3517f92351b0e69170c9656ba70c1249d258926d6d65bd2cbcb49cb34bd1c4fb"}}

Rebuild indexes:

$ docker-compose exec app flask --app OpenMediaMatch.app build_indices
[2024-03-13 18:12:21,582] WARNING in app: No storage class provided, using the default
[2024-03-13 18:12:21,596] INFO in build_index: Running the build_all_indices background task
[2024-03-13 18:12:21,628] INFO in build_index: Building index for pdq (1 signals)
[2024-03-13 18:12:21,630] INFO in build_index: Indexed 1 signals for pdq - 0 seconds
[2024-03-13 18:12:21,631] DEBUG in database: Index[pdq] serializing index to tmpfile /tmp/tmp9_c80zmc
[2024-03-13 18:12:21,631] DEBUG in database: Index[pdq] finished writing to tmpfile, 1 signals 889 bytes - 0 seconds
[2024-03-13 18:12:21,635] DEBUG in database: Index[pdq] imported tmpfile as lobject oid 16750 - 0 seconds
[2024-03-13 18:12:21,635] DEBUG in database: Index[pdq] deallocating old lobject 16747
[2024-03-13 18:12:21,636] DEBUG in database: Index[pdq] cleaned up tmpfile
[2024-03-13 18:12:21,639] INFO in build_index: video_md5 index up to date, no build needed
[2024-03-13 18:12:21,639] INFO in build_index: Completed build_all_indices background task - 0 seconds

Query the bank:

$ curl --location 'localhost:8080/m/lookup?signal_type=pdq&signal=3517f92351b0e69170c9656ba70c1249d258926d6d65bd2cbcb49cb34bd1c4fb'
[]

As you can see, the lookup incorrectly returns no matches even though there should be a match. Adding a second photo, reindexing, and then querying again returns a match:

$ curl --location 'localhost:8080/c/bank/EVIL_CONTENT_BANK/content' \
--form 'photo=@"<second photo path>"'
{"id":2,"signals":{"pdq":"cddcc471737d333771469b9e4c119ce6526e52753f86d1239290469b499941be"}}

$ docker-compose exec app flask --app OpenMediaMatch.app build_indices
[2024-03-13 18:14:20,080] WARNING in app: No storage class provided, using the default
[2024-03-13 18:14:20,093] INFO in build_index: Running the build_all_indices background task
[2024-03-13 18:14:20,124] INFO in build_index: Building index for pdq (3 signals)
[2024-03-13 18:14:20,126] INFO in build_index: Indexed 3 signals for pdq - 0 seconds
[2024-03-13 18:14:20,127] DEBUG in database: Index[pdq] serializing index to tmpfile /tmp/tmp_42xfk4q
[2024-03-13 18:14:20,127] DEBUG in database: Index[pdq] finished writing to tmpfile, 3 signals 1176 bytes - 0 seconds
[2024-03-13 18:14:20,132] DEBUG in database: Index[pdq] imported tmpfile as lobject oid 16751 - 0 seconds
[2024-03-13 18:14:20,132] DEBUG in database: Index[pdq] deallocating old lobject 16750
[2024-03-13 18:14:20,134] DEBUG in database: Index[pdq] cleaned up tmpfile
[2024-03-13 18:14:20,137] INFO in build_index: video_md5 index up to date, no build needed
[2024-03-13 18:14:20,137] INFO in build_index: Completed build_all_indices background task - 0 seconds

$ curl --location 'localhost:8080/m/lookup?signal_type=pdq&signal=3517f92351b0e69170c9656ba70c1249d258926d6d65bd2cbcb49cb34bd1c4fb'
["EVIL_CONTENT_BANK"]