facebook / ThreatExchange

Trust & Safety tools for working together to fight digital harms.
https://developers.facebook.com/docs/threat-exchange
Other
1.18k stars 322 forks source link

[py-tx]Threatexchange CLI PDQ index doesn't validate hash length #1609

Closed thedanielsun closed 2 months ago

thedanielsun commented 4 months ago

There is a corrupt entry in the StopNCII data: ('pdq', '00000000000000000000000000000000')

While it's probably good for StopNCII to remove this data from their upstream as well, it would be nice for threatexchange CLI to ignore corrupt data when rebuilding indexes.

I hit error on this line: https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/signal_type/pdq/pdq_faiss_matcher.py#L240

but there is similar usage on this line as well: https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/signal_type/pdq/pdq_faiss_matcher.py#L185

Dcallies commented 4 months ago

Ugh, we need to fix this ASAP, oversight in the signal normalization in PDQ itself. I think I can generate an easy repo using the file storage and just copy the file.

Thanks a ton @thedanielsun, this issue report is very clear and actionable.