Handling of '.' in usernames

bcoles commented 5 years ago

Some services support usernames containing .. Others do not.

This is problematic for services which make use of a subdomain for profile URLs:

# grep -rn check_uri web_accounts_list.json | grep '//{'
133:         "check_uri" : "http://{account}.blogspot.com",
299:         "check_uri" : "http://{account}.deviantart.com/",
733:         "check_uri" : "http://{account}.insanejournal.com/profile",
866:         "check_uri" : "http://{account}.livejournal.com",
1335:         "check_uri" : "https://{account}.skyrock.com/profil/",
1390:         "check_uri" : "http://{account}.smugmug.com",
1434:         "check_uri" : "http://{account}.soup.io/rss",
1566:         "check_uri" : "http://{account}.tumblr.com",
1766:         "check_uri" : "http://{account}.xanga.com/",

One approach would be to strip all . from usernames only for these services. Other characters such as - and _ may also be problematic.

Alternatively, another approach would be to simply skip these services if the username contains problematic characters.

bcoles commented 5 years ago

A few examples.

Deviantart

Using a known good profile name of test as a starting point:

test.lol.deviantart.com results in HTTP 502 error
lol.test.deviantart.com results in HTTP 502 error

Tumblr

Using a known good profile name of test as a starting point:

test.lol.tumblr.com redirects to the profile for lol
lol.test.tumblr.com redirects to the profile for test

Smugmug

Using a known good profile name of wow as a starting point:

wow.lol.smugmug.com redirects to the profile for wow
lol.wow.smugmug.com returns page not found - presumably it attempted to redirect to the profile for lol, which does not exist

Blogspot

Blogspot appears to handle . appropriately.

WebBreacher commented 5 years ago

Thanks for noticing this. My thought would be to examine the username for a . and, if it is there, any entry that has the username in the domain/subdomain would get skipped. Perhaps we add a flag to all entries for "subdomain" = true for those that place the username in the subdomain. When iterating through the JSON, it'd be easy to pull that flag and then take action.

Thoughts?

bcoles commented 5 years ago

That seems like a good approach.

I think there's value in stripping . from the username at the risk of false positives. Unfortunately, this would not be expected user experience, as this would also introduce inconsistencies between the python scanner tool, and the JSON when consumed by other tools.

There is the alternative of rewriting some of the matches such that the target URL does not make use of subdomains. For example, DeviantArt supports both https://test.deviantart.com/ and https://www.deviantart.com/test (the former redirects to the latter). However, this is a moot point, as DA does not support . in the username.

This does raise another issue: why bother checking for usernames if the username will not exist due to the presence of known bad characters? Perhaps it would make more sense to have a blacklist of known bad characters (such as .) for each JSON entry, and skip usernames containing these characters?

This would allow both the python scanner and tools leveraging the JSON to handle usernames appropriately. For example, if desirable, the python scanner could offer an option to --strip-bad-chars (disabled by default), and external tools could also implement the same parsing. This would ensure consistent user experience.

WebBreacher commented 5 years ago

My gut says to keep it simple and just add a flag for "indicator is in the subdomain" and then skip it if usernames have .. It'd be simple to implement and maintain....much more so than trying to constantly update character blacklists for each site.

bcoles commented 5 years ago

Agreed that stripping bad characters is probably over-complicating it.

much more so than trying to constantly update character blacklists for each site.

The blacklists wouldn't have to be constantly updated. It's rare that a website ever changes which characters are permitted in usernames.

Extensibility is an optional added bonus. So, too, is efficiency, as usernames containing bad characters could be skipped.

In the short term, from an implementation perspective, badchars: "." is effectively the same as subdomain: True, and takes no extra effort to implement. Granted, existence of the username in the subdomain also implies other bad characters which are sometimes allowed in usernames (such as [ and ], among others).

I prefer the bad characters approach, for the above reasons, but also as it encodes all the required information in the JSON, allowing consumption by downstream tools, and ensuring consistency between these tools and the python scanner. Using a Boolean for subdomain is not explicit, and will result in inconsistencies, as downstream tools will have to figure out what exactly that means. Where as iterating over each character in a string is a fairly straight forward process.

On the other hand, unicode makes a mess of things. It's likely that usernames containing unicode would need to be treated differently in instances where the username is present in the URL as a subdomain, in which case subdomain: True would be useful.

WebBreacher commented 5 years ago

You make good points and I appreciate the thoughts and discussion.

Thoughts about using both approaches? the subdomain Boolean AND the blacklist? I mean, if we have to add one or the other to every entry already, it is trivial to add them both. Additionally, downstream, tools could make their own decisions about which to use.

bcoles commented 5 years ago

I foresee no obvious problems with implementing both.

WebBreacher commented 5 years ago

then...let's do that

-------- Original Message -------- On Jun 1, 2019, 09:37, bcoles wrote:

I foresee no obvious problems with implementing both.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

WebBreacher commented 4 years ago

Rechecking on this....going to try to make some time to implement this soon.

Zedahkweb commented 2 years ago

A slightly different spin on this - Some sites simply ignore the text after the . for patterns like: url.com/user.name (name is ignored). Example: https://jsfiddle.net/user/jacob.jacob appears to be the same content and user as https://jsfiddle.net/user/jacob (just chose a random name that had an account). Not sure if appropriate to address with this issue, or if I should open a new one for the sites that I've seen doing this? I have some ideas on solutions but don't want to propose if you have already considered this behaviour in an upcoming fix.

WebBreacher commented 2 years ago

@Zedahkweb let's open a new issue for the above please.

WebBreacher commented 2 years ago

Going to close this as it is now a requirement in the #430

WebBreacher / WhatsMyName

Handling of '.' in usernames #55