ThioJoe / YT-Spammer-Purge

Allows you easily scan for and delete scam comments using several methods.
GNU General Public License v3.0
4.56k stars 390 forks source link

Filtering: New type of hiding scam numbers #393

Closed Firecul closed 2 years ago

Firecul commented 2 years ago

Filter Mode

Auto-Smart Mode

Select the Problem

A type of spammer is not detected at all

(Optional) If 'Other', Enter Very Short Description

Splitting the number across several messages can't be detected currently

Spammer Example / Sample

You can find the comments as replies here https://www.youtube.com/watch?v=RQKQNxLeq1c&lc=UgwwK9EwblTpbPSLrWl4AaABAg chrome_The_Fed_is_About_to_Trigger_a_Recession_-_YouTube_

Video / Post Link

https://www.youtube.com/watch?v=RQKQNxLeq1c&lc=UgwwK9EwblTpbPSLrWl4AaABAg

(Optional) Additional Info / Context

No response

Rairye commented 2 years ago

@Firecul Is there a mode that looks at comment frequency/intervals? I would assume that someone trying to obfuscate any text (not just phone numbers) by splitting it into separate lines would leave all the comments within a certain timeframe.

Firecul commented 2 years ago

Is there a mode that looks at comment frequency/intervals?

No there isn't, that might work but I'm not sure what data is available through the API. It'd prob need more processing than normal though to keep track of comment times when possibly sorting through 10000s of comments

ThioJoe commented 2 years ago

The YouTube API actually does provide the time a comment was posted down to the second, but I'm not sure if it would be worth doing

Rairye commented 2 years ago

@ThioJoe @Firecul Yeah, it would require some extra work, but there is a relatively easy solution.

You would have to iterate through the comments and create an inverted index by author (basically, a dictionary where the keys are the name of the author and the values are a list of the indices of the comments made by that author, such as {"author1" : [2, 3], "author2" : [5] }). Then, iterate through the inverted index to find authors who have at least n comments and then calculate the frequency/intervals for those comments only.

The only other solutions that come to mind would be to: (1) sort comments by author and chronologically and then join them into a single string before running it through the filter, or (2) check if a comment is a fraction of a telephone number by using something like is_number_fraction = not any ([char.isalpha() for char in comment])

ThioJoe commented 2 years ago

Adding all those checks would probably increase the processing time and slow down the scanning considerably I think

Rairye commented 2 years ago

Yeah, I mean, I don't think (1) and (2) are viable solutions as (1) would take too long and (2) would return False in almost all cases. The reason I propose the solution using the inverted index is that it would detect more spammy behavior than just the example shown here. Obviously, if that overcomplicates things, users can still find this type of spam using regex.

RacerDelux commented 2 years ago

I would just look for a + or two numbers. If true, check to see if the next comment by the user has just numbers. This could help find the spam comment, but have minimal impact on scanning.

Rairye commented 2 years ago

@RacerDelux Yes, something like (2) I posted above would be able to look for both those patterns. It would also work if there are spaces, punctuation marks, emojis, etc. between numbers. (Would also detect 2 or more numbers and work even if the numbers are written like ⑧, ⁸, 8, etc.)

Rairye commented 2 years ago

@Firecul Are you using regex to search for telephone numbers? I wonder if using a pattern of "+" + country code would work. (Assuming that all telephone numbers contain a country code.)

Firecul commented 2 years ago

@Rairye I didn't use regex, I just came across it when finding other spam. If they are smart they will use country codes otherwise it greatly restricts potential victims, they need to make it as simple as possible to fall for.

Rairye commented 2 years ago

@Firecul Ah, alright. I just read through the code of that mode and it looks like it uses regex dictionaries. If scammers almost always include the country code, then maybe that pattern could be added in the mode. (Edit: Or manually add country codes as spam words.)

ThioJoe commented 2 years ago

This is basically covered by the detection of spam threads as a whole in 2.15, so gonna close this unless it becomes necessary again