Charcoal-SE / SmokeDetector

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.
https://metasmoke.erwaysoftware.com
Apache License 2.0
474 stars 182 forks source link

Separate facility for tracking phone numbers #2169

Closed tripleee closed 6 years ago

tripleee commented 6 years ago

We have a fair amount of phone numbers in bad_keywords.txt now, and another large batch in watched_keywords.txt. It has been brought up numerous times that this resource should probably be handled separately, sort of like domain names are now separate from keywords. Phone numbers have a number of pesky properties which makes them thorny to handle by regex, so we would probably like to have them stored in a canonical format and handle various formatting variations in the code which consults this canonical representation.

tripleee commented 6 years ago

See also discussion around here

https://chat.stackexchange.com/transcript/message/44585771#44585771

Makyen suggests doing a straightforward "this is a phone number" rule and a separate lower-confidence "this might be a phone number" heuristic.

GaurangTandon commented 6 years ago

Hello! I did some regex-fu, specifically, a short JS code:

var s="";content.split(/\s/g).forEach(x => /\d{4}/g.test(x) && !/^\d+$/.test(x) && !/com/.test(x) ? s+=x + "\n" : 0);
// based on the observation that each phone number always have a `\d\d\d\d` block in it

that gave me all the phone numbers present in each of the files watched_keywords.txt and bad_keywords.txt, plus some false positives which I cleared out manually. The result is these two separate pastes of all the phone numbers that have been watched/badded till now:

Spam phone numbers from watched_keywords.txt Spam phone numbers from bad_keywords.txt

Hopefully that's helpful in easily completing this facility.

tripleee commented 6 years ago

Thanks, but we probably just want to keep the numbers and nothing else in the canonical format if we proceed with this. It's easy enough to extract and normalize them.

.+1310489717 apparently isn't a phone number, but other than that, this seems accurate.

GaurangTandon commented 6 years ago

.+1310489717 apparently isn't a phone number

Yes, probably, I left it in only because it had 10 digits.

keep the numbers ....in the canonical format....It's easy enough to extract and normalize them.

Yeah, that's easy. By normalizing, you mean removing all those hyphens, underscores and regex keywords? Basically, any [^\d] character?

tripleee commented 6 years ago

Yeah, the approach I would take is to reduce them to just numbers and then reduce any extracted number sequence to the same format and compare. But whoever ends up working on this might take a different approach so this is slightly premature. Still, getting a feel for how many phone number we actually have is helpful.

teward commented 6 years ago

I hate to ask, but how are we going to handle international numbers? Or, do we only care about US-formatted numbers, and ignore the rest? The majority of numbers I've seen appear to be US-formatted, but I wasn't 100% sure

makyen commented 6 years ago

@teward Yes, we need to handle US (actually NANP) and non-US numbers.

I'm working on gathering data for a longer comment here, but the gist of it, is that I suggest we have at least two (probably 3) detection bins based on how closely the digit sequence follows the format for a valid phone number. This would allow us to have a detection that is ~97%+ that it's a phone number (i.e. it's formatted like a phone number and is potentially valid, so the author wants people to see it that way), and a bin where it's much less likely that it's a phone number, but it might be a phone number. This would allow weight to be more accurately assigned, rather than both decreasing the weight for a good detection and increasing the weight for a not-so-good detection.

I'll have some numbers a bit later today.

teward commented 6 years ago

@makyen Then we have a bit of a problem.

There's no one all-inclusive Regex that could ever be written to determine if there's a phone number in a post. Such a regex would be extremely complex even if we assume the number being provided is a phone number that is dialed from the USA to an International location.

Even if we leveraged something like the phonenumbers module in PyPI and its search functions, we'd have to iterate over all country codes for a post just to see if it has any phone number in it. That alone would be a massive performance hit.

I don't see there being a solution to this that includes all possible phone number formats in any type of non-performance-nuking way...

makyen commented 6 years ago

I think that we might have different concepts of what could/should be achieved. I'm not intending to indicate that we should detect every single possible phone number format. While a not-entirely-regex version of determination can do a significantly better job than something that's solely regex based, even a regex based version can get 80%-90% there without too much difficulty.

I'm not saying that we need to be perfect, just that we can be better. It's relatively easy to be able to say "this really looks like a phone number" and "this might be a phone number". Then add to those a list of numbers which are watched/blacklisted.

Sure, if we want to be perfect, it's a lot of work, but the phone numbers we're wanting to detect as "this really looks like a phone number" are a relatively small subset of those possible worldwide, due to how spam is targeted.

teward commented 6 years ago

@makyen but that's my point. To be even 80% sure you have to adapt to all potential localized phone numbers. US destination numbers, that's +1AAABBBCCCC where AAA is area code, BBB and CCCC are the other bits. This changes when dialing certain countries, where the numerical formats for phone numbers differ from the US formats.

My point is, how are you going to determine which subsets we target? Sure, we can write a regex that could catch phone numbers, and then run whatever we capture through something like phonenumbers which would be able to actually find the numbers with more accuracy, but that's still missing my point, which is "Which 'subsets' of numbers do we target?"

That's my question really. Most of the numbers I've seen are with US-callers in mind, at least the ones I've seen in Smokey and the spambots I keep an eye on, so obviously US numbers are one subset, but that doesn't address the rest of my inquiry on this.

teward commented 6 years ago

To clarify my last comment, let's take an actual example I pulled out of the list of phone numbers that were pasted here, with a from-the-spam-post example of '044-6565 6523'.

Using the following code:

import pycountries  # from PyPI / pip / pip3
import phonenumbers  # from PyPI / pip / pip3

country_codes = [country.alpha_2 for country in list(pycountry.countries)]  # Store the list of 2-char country codes

text = # The actual raw content from https://metasmoke.erwaysoftware.com/post/99000

matches = []
for cc in country_codes:
    for result in phonenumbers.PhoneNumberMatcher(text, cc):
        matches.append((phonenumbers.format_number(result.number, phonenumbers.PhoneNumberFormat.E164), cc))

for item in matches:
    print(item)

... will yield the following matched numbers based on Country Codes (prettified):

('+3584465656523', 'AX')
('+8804465656523', 'BD')
('+554465656523', 'BR')
('+494465656523', 'DE')
('+3584465656523', 'FI')
('+914465656523', 'IN')
('+984465656523', 'IR')
('+824465656523', 'KR')
('+252446565', 'SO')
('+508446565', 'PM')
('+904465656523', 'TR')

As you can see, that number doesn't match any legitimate phone numbers except in those specific countries' 'number' search sets, and it matches different numbers as you can see.

And for yet another example (which matches blacklisted regex of 3\W*463\W*119\W*7525), the detected number is valid for 249 different countries' relative phone number pattern matches translated to E164 format.

Almost regardless of what way we implement this, the point I am trying to make is that until we decide a specific subset of countries' valid phone numbers to try and match against, we can't really do this in any sane way, and there's bound to be numerous country regions or country-valid matches for any given format of number.

One way to do this would be to augment the code I have above to say "Possible phone number (CC)" as the region where CC is the country code, and then use that as our "likely a number" match algorithm, but I still wouldn't trust this as it probably would still have a high false positive rate. (that is, if len(results) >= 1, then flag that reason)

teward commented 6 years ago

Perhaps if I rephrase my question this way:

For what specific subsets of all possible number formats are we actively looking to match? US, I gather, will be one of them. But the point I'm trying to make is: if we want to pick valid number formats to match on, then we also need to know which specific countries formats we're matching. With that in mind we can reduce to a specific number of results based on 'target countries' in the spam, rather than having to deal with the set of all possible numbers.

And yes I did have far too much time on my hands today at work, this is how I did this brief rundown of the headaches of this until we have a narrow-defined scope.

tripleee commented 6 years ago

The fake tech support phone numbers are dominantly US, but we also have large subsets of Nigerian (+234), Indian (+91), Chinese (+86), Russian (+7), UK (+44), and other phone numbers.

The question of how to find phone numbers is basically solved already; Smoke Detector contains logic to extract apparent phone numbers using a library and some heuristics around it.

Halflife does "the simplest thing that could possibly work" - it extracts any digit sequence of a particular length (IIRC 10-14 digits). There will be some false positives on dates, Unix timestamps, IP addresses etc, but they won't match anything in the blacklist, and preventing the lookup in the blacklist is probably more expensive in terms of CPU cycles and code complexity than living with this small overhead. (Not so in Halflife, where it performs a rather heavy Metasmoke search on each extracted candidate; I'm planning to put in a hedge at least for obvious IP addresses and probably some range of calendar expressions.)

tripleee commented 6 years ago

As an aside, the blacklist also contains some chat IDs etc. It might not be a bad thing if we had a facility for saying "this resembles a phone number but isn't really; this one is (say) a QQ number." (I believe QQ is a Chinese chat platform, vaguely similar to ICQ or AIM at least in the numeric ID they too used.)

Sent with GitHawk

tripleee commented 6 years ago

A Metasmoke search on many of the regexes where \W* or [\W_]* occurs between digits will return good samples. Some of the older phone regexes are not generalized to accept spacing etc variations. In addition to fake tech support, look for hacking, witch doctor, Russian oil, Illuminati, couterfeit currency or ATM card etc spam for samples. Some of these categories are properly tagged in Metasmoke, others will require manual search (please add tags if you do this so the work doesn't have to be repeated).

tripleee commented 6 years ago

We talked in chat about maybe adding a syntactic option to tag things for limiting their scope to a particular set of sites. This is tangential, but might be useful for the propsed phone facility, too. If the tags can contain arbitrary information, not just sites, you could do things like

!!/watch {qq} 123467890

to add 1234567890 to the watchlist with the tag "qq" to indicate that it's a QQ chat ID ...?

https://chat.stackexchange.com/transcript/message/45107883#45107883

The chat discussion proposes something like

!!/watch {blender,askdifferent} foo

to add foo to the watchlist, but only trigger on Blender.SE and Ask Different. But the syntax could be extended to have e.g. a {sites:blender,askdifferent} variant and a {phone:qq} variant, or whatever.

Just thinking out loud here, sorry (-:

tripleee commented 6 years ago

I have a phone branch in my own fork of SmokeDetector here which implements a rough cut of what I have been discussing with Makyen and others in this thread. If there is some consensus that we could proceed with this design, I will push it to the main repo so we can coordinate and collaborate on this work.

The code uses a simple regex to extract phone number candidates from a post, then compares the extracted candidates against a list of phone numbers. The list is supposed to be in "original" format, so that we can match e.g. 1 (234) 555-1212 exactly; but if an exact match is not found, it then normalizes both the extracted phone number and the listed number to just digits 12345551212 and tries that as well as a fallback. (Thanks to Makyen for this idea.) This means we can't use regex, but we can match numbers precisely with a high confidence, and yet have a slightly weaker detection which might have some FPs.

The first commit contains just a single blacklisted number and a single watched number. If we want to proceed with this, blacklisted_phones.txt and watched_phones.txt should be populated with phone numbers from posts (not regexes!) which is a significant undertaking.

There is also no provision for updating the lists from chat yet; if we like this idea, adding those code paths is a minor undertaking.

Finally, I think this should get a substantial test suite of its own, probably as the first step before attempting to start a migration of phone numbers.

AWegnerGitHub commented 6 years ago

Knowing nothing about speed of the following regex, I found one by Zapier that claims to

Works with all standard phone numbers, including country and area codes for most international numbers. Anything from +65 800 123 4567 ext.405 to 02-201-1222 to 865.101.1000 and more should work.

(?:(?:\+?([1-9]|[0-9][0-9]|[0-9][0-9][0-9])\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([0-9][1-9]|[0-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?

Source

It might be worth at least testing with. Perhaps it can help knock out some false positives?

tripleee commented 6 years ago

https://github.com/Charcoal-SE/SmokeDetector/commits/phone contains work by @iBug to add chat commands to the basic facility I drafted in the fork discussed above.

It looks good on the whole, but in order to get it into production, we should take a sprint to migrate many phone numbers to the dedicated numbers blacklist and watchlist.