discord-jda / JDA

Java wrapper for the popular chat & VOIP service: Discord https://discord.com
Apache License 2.0
4.28k stars 735 forks source link

MarkdownSanitizer sanitizes URLs #2378

Open EclipsedSolari opened 1 year ago

EclipsedSolari commented 1 year ago

General Troubleshooting

Expected Behaviour

Discord treats URLs (almost) as literals, even URLs with Markdown characters in them.

Consider the following Google search: https://www.google.com/search?q=__test__

Although the string __test__ is Markdown for underlined text, because it's part of a URL, Discord leaves it as-is.

JDA's MarkdownSanitizer does not make the distinction between URLs and normal text, so JDA escapes this URL: https://www.google.com/search?q=\_\_test\_\_

When trying to send this URL as a Discord message, Discord replaces the backslashes with forward slashes but doesn't apply Markdown rules, including escaping, so the end result is the broken URL https://www.google.com/search?q=/_/_test/_/_.

I expected JDA's Markdown handler to leave URLs as-is.

Code Example for Reproduction Steps

//add this to the EscapeMarkdownTest class in src/test/java/MarkdownTest.java
//all of the assertions except the first fail
@Test
public void testUrl()
{
    Assertions.assertEquals("https://www.google.com/", markdown.compute("https://www.google.com/"));
    Assertions.assertEquals("https://www.google.com/search?q=_test_", markdown.compute("https://www.google.com/search?q=_test_"));
    Assertions.assertEquals("https://www.google.com/search?q=__test__", markdown.compute("https://www.google.com/search?q=__test__"));
    Assertions.assertEquals("https://www.google.com/search?q=*test*", markdown.compute("https://www.google.com/search?q=*test*"));
    Assertions.assertEquals("https://www.google.com/search?q=**test**", markdown.compute("https://www.google.com/search?q=**test**"));
    Assertions.assertEquals("https://www.google.com/search?q=***test***", markdown.compute("https://www.google.com/search?q=***test***"));
    Assertions.assertEquals("https://www.google.com/search?q=~~test~~", markdown.compute("https://www.google.com/search?q=~test~"));
}

Code for JDABuilder or DefaultShardManagerBuilder used

N/A

Exception or Error

N/A
EclipsedSolari commented 1 year ago

I'm willing to take a shot at this issue, but I'd like to discuss the correct way to approach it first.

The current entities in MarkdownSanitizer are represented by simple tokens or by regex. Parsing URLs with regex is an unpleasant problem, and I would probably have to roll my own regex since I'm unsure of licensing (and correctness) implications if I grab one from somewhere else on the internet.

Java does have a URL parser, but its implementation seems like a bad fit for this use case, since I'd have to construct a new URL and potentially catch an exception (MalformedURLException) on every possible text sequence.

I could explore a hybrid approach, with a regex to detect possible leading URLs in a sequence (i.e. in the text sequence https://www.google.com?q=~~test~~ some other text here, each space indicates the end of a potential URL) and feed those potential URLs to java.net.URL for validation.

I'm open to any other ideas.

LaFriska commented 1 year ago

I am currently giving this issue a shot. This is somewhat complicated as previously suggested, but as long as the programme detects and URL, replaces it with a placeholder and changes it back after the algorithm, it should work correctly.

I would be able to make a pull request tomorrow.

LaFriska commented 1 year ago

For the final assertion in that test method, I think you may have made a mistake: the expect result should be the URL with only one layer of "~" wrapped around the word "test" in the url, not two.