facebook / ThreatExchange

Trust & Safety tools for working together to fight digital harms.
https://developers.facebook.com/docs/threat-exchange
Other
1.15k stars 300 forks source link

[py-tx] inconsistency in `hash` and `match --hashes` format #1188

Open master-hax opened 1 year ago

master-hax commented 1 year ago

threatexchange hash photo <path-to-image-file> outputs hashes in the form

pdq f5b44e4f1673f4e44e57fd6d021201effd8b40d6b54906d3dc302df672160708

however

threatexchange match --hashes photo <path-to-hash-file> expects a single file without a "pdq " prefix

this causes an inconsistency where hashes generated by threatexchange hash can't be directly used in threatexchange match --hashes

Dcallies commented 1 year ago

Hey @master-hax, thanks for reaching out!

This is currently intended behavior, however, providing consistency between the file collab type and --hashes might be useful.

--hashes should be requiring you to also do -S to pick the type.

The way I've done it is using cut -d ' ' -f 2 to get only the hashes.

Another trick might be to do something like dataset -p does, which is not print columns that you are explicitly selecting (since otherwise | grep would have worked).

@master-hax some questions:

  1. Would your expectation be that hash and match be directly compatible?
  2. What would the behavior be if someone did threatexchange match --hashes -S pdq photo -- text is not pdq
master-hax commented 1 year ago

Hey @Dcallies, thanks for the fast response

responses:

  1. I expect hash and match --hashes commands to be compatible even if they are not compatible by default. e.g. maybe we could add a flag to hash like hash --raw or hash --no-prefix that outputs the hashes in a match compatible format? (without the "pdq " prefix)
  2. Due to the -S pdq args, i think threatexchange should attempt to interpret the string "text is not pdq" as a pdq hash. and then possibly fail due to the input not being in hexadecimal or not being the correct length
Dcallies commented 1 year ago

After thinking about it for a bit, I think it's reasonable to expect that match --hashes and hash will be compatible.

However, definitely want to limit the number of arguments on these commands, especially if there are equivalents in existing bash tools.

--hashes requiring -S has bothered me for a while, I propose we do the following:

  1. Remove the requirement that match --hashes requires -S
  2. --hashes will default to looking for the signal type prefix and will interpret it in order to select the right index
  3. -S with only one option will strip the appropriate prefix (i.e. 'pdq ') if it finds it, and accept a naked hash

This solution provides:

  1. backwards compatibility
  2. requested functionality

But has the downside of potential ambiguity if there is a future signal type which the hash can contain the prefix of its signal type. This is possible with 'raw_text' which a string that started with 'raw_text' would be discovered if using -S with only one option. This seems okay since raw_text is an innocuous string, and has a simple workaround (no -S).

To handle this, we can update the document strings and the error message for missing prefix on hashes.

Dcallies commented 1 year ago

Working with Sam on this as a ramp-up task for the upcoming hackathon!