Closed bcoles closed 2 years ago
A few examples.
Deviantart
Using a known good profile name of test
as a starting point:
test.lol.deviantart.com
results in HTTP 502
errorlol.test.deviantart.com
results in HTTP 502
errorTumblr
Using a known good profile name of test
as a starting point:
test.lol.tumblr.com
redirects to the profile for lol
lol.test.tumblr.com
redirects to the profile for test
Smugmug
Using a known good profile name of wow
as a starting point:
wow.lol.smugmug.com
redirects to the profile for wow
lol.wow.smugmug.com
returns page not found - presumably it attempted to redirect to the profile for lol
, which does not existBlogspot
Blogspot appears to handle .
appropriately.
Thanks for noticing this. My thought would be to examine the username for a .
and, if it is there, any entry that has the username in the domain/subdomain would get skipped. Perhaps we add a flag to all entries for "subdomain" = true for those that place the username in the subdomain. When iterating through the JSON, it'd be easy to pull that flag and then take action.
Thoughts?
That seems like a good approach.
I think there's value in stripping .
from the username at the risk of false positives. Unfortunately, this would not be expected user experience, as this would also introduce inconsistencies between the python scanner tool, and the JSON when consumed by other tools.
There is the alternative of rewriting some of the matches such that the target URL does not make use of subdomains. For example, DeviantArt supports both https://test.deviantart.com/
and https://www.deviantart.com/test
(the former redirects to the latter). However, this is a moot point, as DA does not support .
in the username.
This does raise another issue: why bother checking for usernames if the username will not exist due to the presence of known bad characters? Perhaps it would make more sense to have a blacklist of known bad characters (such as .
) for each JSON entry, and skip usernames containing these characters?
This would allow both the python scanner and tools leveraging the JSON to handle usernames appropriately. For example, if desirable, the python scanner could offer an option to --strip-bad-chars
(disabled by default), and external tools could also implement the same parsing. This would ensure consistent user experience.
My gut says to keep it simple and just add a flag for "indicator is in the subdomain" and then skip it if usernames have .
. It'd be simple to implement and maintain....much more so than trying to constantly update character blacklists for each site.
Agreed that stripping bad characters is probably over-complicating it.
much more so than trying to constantly update character blacklists for each site.
The blacklists wouldn't have to be constantly updated. It's rare that a website ever changes which characters are permitted in usernames.
Extensibility is an optional added bonus. So, too, is efficiency, as usernames containing bad characters could be skipped.
In the short term, from an implementation perspective, badchars: "."
is effectively the same as subdomain: True
, and takes no extra effort to implement. Granted, existence of the username in the subdomain also implies other bad characters which are sometimes allowed in usernames (such as [
and ]
, among others).
I prefer the bad characters approach, for the above reasons, but also as it encodes all the required information in the JSON, allowing consumption by downstream tools, and ensuring consistency between these tools and the python scanner. Using a Boolean for subdomain
is not explicit, and will result in inconsistencies, as downstream tools will have to figure out what exactly that means. Where as iterating over each character in a string is a fairly straight forward process.
On the other hand, unicode makes a mess of things. It's likely that usernames containing unicode would need to be treated differently in instances where the username is present in the URL as a subdomain, in which case subdomain: True
would be useful.
You make good points and I appreciate the thoughts and discussion.
Thoughts about using both approaches? the subdomain Boolean AND the blacklist? I mean, if we have to add one or the other to every entry already, it is trivial to add them both. Additionally, downstream, tools could make their own decisions about which to use.
I foresee no obvious problems with implementing both.
then...let's do that
-------- Original Message -------- On Jun 1, 2019, 09:37, bcoles wrote:
I foresee no obvious problems with implementing both.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Rechecking on this....going to try to make some time to implement this soon.
A slightly different spin on this - Some sites simply ignore the text after the . for patterns like: url.com/user.name (name is ignored). Example: https://jsfiddle.net/user/jacob.jacob appears to be the same content and user as https://jsfiddle.net/user/jacob (just chose a random name that had an account). Not sure if appropriate to address with this issue, or if I should open a new one for the sites that I've seen doing this? I have some ideas on solutions but don't want to propose if you have already considered this behaviour in an upcoming fix.
@Zedahkweb let's open a new issue for the above please.
Going to close this as it is now a requirement in the #430
Some services support usernames containing
.
. Others do not.This is problematic for services which make use of a subdomain for profile URLs:
One approach would be to strip all
.
from usernames only for these services. Other characters such as-
and_
may also be problematic.Alternatively, another approach would be to simply skip these services if the username contains problematic characters.