dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.42k stars 4.75k forks source link

UrlEncoding with TextEncoding #23304

Closed Eilon closed 4 years ago

Eilon commented 7 years ago

From @cilerler on August 20, 2017 22:13

I have a text that retrieved from the web and saved into a file with encoding 1252.

Text has URL's in it with some emoji's. However unless I get the result from method1 below, I can not be able to retrieve those pages.

Since we don't have System.Web in dotnet-core and no support from System.Net.WebUtility for TextEncoding,

I wonder, if you may provide me a simple example how to achieve this via new UrlEncoder.

var text = "hello-world 💟";

Encoding sourceEncoding = Encoding.GetEncoding(1252);

// method 1 // hello-world+%f0%9f%92%9f // <----- result, I would like to get
System.Web.HttpUtility.UrlEncode(text, sourceEncoding).Dump();

// method 2 // hello-world+%c3%b0%c5%b8%e2%80%99%c5%b8
System.Web.HttpUtility.UrlEncode(text).Dump();                                                              

// method 3 // hello-world+%C3%B0%C5%B8%E2%80%99%C5%B8
System.Net.WebUtility.UrlEncode(text).Dump();                                                               

// method 4 // hello-world%20%C3%B0%C5%B8%E2%80%99%C5%B8
System.Uri.EscapeUriString(text).Dump();  

Thanks in advance 🤗

Copied from original issue: aspnet/HtmlAbstractions#46

svick commented 7 years ago
var text = "hello-world 💟";

This is not how you would correctly represent any emoji as a C# string. Also, I don't think you can represent emojis in Windows 1252.

Considering that %f0%9f%92%9f is URL encoded UTF-8 for U+1F49F HEART DECORATION, I'm going to assume that's the emoji you want.

Are you sure the file you're reading is encoded using 1252? If I convert "💟" back to bytes using 1252 and then interpret the bytes as UTF-8, I get the string for U+1F49F HEART DECORATION:

var bytes = Encoding.GetEncoding(1252).GetBytes("💟");
var s = Encoding.UTF8.GetString(bytes).Dump(); // prints "💟"

Using Uri.EscapeUriString() on that will then give the result you're expecting:

System.Uri.EscapeUriString(s).Dump(); // prints "%F0%9F%92%9F"
cilerler commented 7 years ago

This is not how you would correctly represent any emoji as a C# string

Agreed. However HTMLs are encoded and have to be able to read and use it... Even further I realized that some of those emojis are missing characters (they should each have to have 4 however they have 3 etc.)

Anyways, It is really interesting that I picked an emoji that actually resolves under your approach. However it doesn't work for others.

Since I can not update the question itself, here is a new sample that doesn't work.

var text = $"hello-world 🖤ðŸ’--œðŸ’œðŸ–";

Encoding sourceEncoding = Encoding.GetEncoding(1252);

// method 1 // hello-world+%f0%9f%96%a4%f0%9f%92--%9c%f0%9f%92%9c%f0%9f%96 // <----- result, I would like to get
System.Web.HttpUtility.UrlEncode(text, sourceEncoding).Dump();

// method 2 // hello-world+%c3%b0%c5%b8%e2%80%93%c2%a4%c3%b0%c5%b8%e2%80%99--%c5%93%c3%b0%c5%b8%e2%80%99%c5%93%c3%b0%c5%b8%e2%80%93
System.Web.HttpUtility.UrlEncode(text).Dump();

// method 3 // hello-world+%C3%B0%C5%B8%E2%80%93%C2%A4%C3%B0%C5%B8%E2%80%99--%C5%93%C3%B0%C5%B8%E2%80%99%C5%93%C3%B0%C5%B8%E2%80%93
System.Net.WebUtility.UrlEncode(text).Dump();

// method 4 // hello-world%20%C3%B0%C5%B8%E2%80%93%C2%A4%C3%B0%C5%B8%E2%80%99--%C5%93%C3%B0%C5%B8%E2%80%99%C5%93%C3%B0%C5%B8%E2%80%93
System.Uri.EscapeUriString(text).Dump();

// method5 // hello-world%20%F0%9F%96%A4%EF%BF%BD--%EF%BF%BD%F0%9F%92%9C%EF%BF%BD
System.Uri.EscapeUriString(Encoding.UTF8.GetString(sourceEncoding.GetBytes(text))).Dump();

// method6 // hello-world+%F0%9F%96%A4%EF%BF%BD--%EF%BF%BD%F0%9F%92%9C%EF%BF%BD
System.Net.WebUtility.UrlEncode(Encoding.UTF8.GetString(sourceEncoding.GetBytes(text))).Dump();
svick commented 7 years ago

Anyways, It is really interesting that I picked an emoji that actually resolves under your approach. However it doesn't work for others.

The examples where it doesn't work aren't valid UTF-8 emoji:

Because of this, Encoding.UTF8.GetString() fails to decode the string and produces U+FFFE Unicode Noncharacter (UTF-8 encoding 0xEF 0xBF 0xB).

So it looks like you have something that cannot be correctly represented as a string. If you need to work with this kind of broken input, you might have to stay away from built-in methods like Uri.EscapeUriString() and write the code to do that yourself.

In case the input is not supposed to be broken like this, what emoji was it supposed to represent?

cilerler commented 7 years ago

I hear you and I appreciate your effort here. All your finding are absolutely correct.

I shouldn't mention the emoji part, regardless of what it is, encoder should be able to encode those characters like System.Web.HttpUtility.UrlEncode does.

So unless there is no new way to handle this, question will be why not the new one behaves the same way?

karelz commented 7 years ago

Can you achieve the same by changing the encoding first (System.Text.Encoding.Convert)), then using the System.Uri.Escape method?

karelz commented 7 years ago

Also it seems that the method is back in .NET Core 2.0: System.Web.HttpUtility.UrlEncode That should solve your problems.

cilerler commented 7 years ago

@karelz here it is, no luck 🍀

// hello-world%20%C3%B0%C5%B8%E2%80%93%C2%A4%C3%B0%C5%B8%E2%80%99--%C5%93%C3%B0%C5%B8%E2%80%99%C5%93%C3%B0%C5%B8%E2%80%93
var text1252 = sourceEncoding.GetBytes(text);
var utf8Bytes = System.Text.Encoding.Convert(sourceEncoding, Encoding.UTF8, text1252);
var utf8String = Encoding.UTF8.GetString(utf8Bytes);
var escapeUri = System.Uri.EscapeUriString(utf8String);
escapeUri.Dump();
cilerler commented 7 years ago

@karelz as you stated System.Web.HttpUtility.UrlEncode exist in .NetCore2.0 (also in .NetStandard2.0) and solves my problem. Thank you! 🤗

But I would like to use this opportunity to point out that outputs between those methods are different.

I mean System.Text.Encodings.Web.UrlEncoder.Default.Encode vs ( System.Web.HttpUtility.UrlEncode and Dynamically Invoked System.Net.WebUtility.UrlEncode)

And I believe there is a missing implementation somewhere. (or maybe just a missing documentation that explains how to do it with the new method)

Side notes;

  1. Dynamically Invoked System.Net.Web.Utility.UrlEncode is providing exactly same output as System.Web.HttpUtility.UrlEncode but that method is not invocable in .NetCore2.0

  2. System.Web.HttpUtility.UrlEncode and System.Net.WebUtility.UrlEncode are completely null under .NetCore2.0 (you may find the screenshots below)

.Net Framework 4.6.1

dotnetframework

.Net Core 2.0

dotnetstandard

  1. Code below is what System.Web.HttpUtility.UrlEncode does. (reference purpose)
/// <summary>
/// System.Text.Encodings.Web.UrlEncoder.Create(UnicodeRanges.MiscellaneousSymbols);
/// </summary>
public sealed class HttpUtility
{

    /// <summary>Encodes a URL string using the specified encoding object.</summary>
    /// <param name="str">The text to encode. </param>
    /// <param name="e">The <see cref="T:System.Text.Encoding" /> object that specifies the encoding scheme. </param>
    /// <returns>An encoded string.</returns>
    public static string UrlEncode(string str, Encoding e)
    {
        if (str == null)
            return (string)null;
        return Encoding.ASCII.GetString(HttpUtility.UrlEncodeToBytes(str, e));
    }

    /// <summary>Converts a string into a URL-encoded array of bytes using the specified encoding object.</summary>
    /// <param name="str">The string to encode </param>
    /// <param name="e">The <see cref="T:System.Text.Encoding" /> that specifies the encoding scheme. </param>
    /// <returns>An encoded array of bytes.</returns>
    public static byte[] UrlEncodeToBytes(string str, Encoding e)
    {
        if (str == null)
            return (byte[])null;
        byte[] bytes = e.GetBytes(str);
        return HttpEncoder.Current.UrlEncode(bytes, 0, bytes.Length, false);
    }
}
karelz commented 7 years ago

I am not sure what you mean by "not invokeable". The "null" implementations are just reference assembly code. Not the real implementation code.

Closing as resolved per reply above.

cilerler commented 7 years ago

Thanks for the explanation, makes more sense now.

About "not invocable"; I meant following code returns null in .NetCore2

MethodInfo method = typeof(System.Net.WebUtility).GetMethod(nameof(System.Net.WebUtility.UrlEncode),
                                                            BindingFlags.Static | BindingFlags.NonPublic,
                                                            null,
                                                            new[]
                                                            {
                                                                typeof(byte[]),
                                                                typeof(int),
                                                                typeof(int)
                                                            },
                                                            null);
davidsh commented 7 years ago

@cilerler I see that you are referring to using Reflection when you are refer to "invocable".

Please keep in mind that using Reflection is not a supported technique in terms of .NET. While it might be a stopgap measure to use Reflection at times, internal changes to .NET Framework or .NET Core may result in a particular Reflection approach no longer working. In general, bug fixes, security fixes, feature changes will affect Reflection. So, it is not a supported thing that remains consistent between versions. Only the public API surface is supported.

cilerler commented 7 years ago

Thank you @davidsh.

Since that method is only available through reflection (and as you said it is not the ideal way to handle things) and System.Text.Encodings.Web.UrlEncoder.Default.Encode is not generating the same result that System.Web.HttpUtility.UrlEncode generates, 🚨 I believe this issue should be stay as open.

Also is there a place to see UrlEncode method/s in System.Net.WebUtility in .NetCore2.0. Looking for this line to be exact but that one is for .NetFramework4.7 and couldn't find System.Net.WebUtility source in corefx neither.

karelz commented 7 years ago

Calling internal/private methods was never supported anywhere.

I am lost in what is different between what and why it matters. Can you restate the issue? (from scratch, don't link previous comments)

cilerler commented 7 years ago

input

http://not-exist-website.doNotTry/hello-world 🖤ðŸ’--œðŸ’œðŸ–

outputs

Encoder Encoded Value
System.Web.HttpUtility.UrlEncode w/ (CodePagesEncodingProvider.Instance.GetEncoding(1252)) http://not-exist-website.doNotTry/hello-world+%f0%9f%96%a4%f0%9f%92--%9c%f0%9f%92%9c%f0%9f%96/
[DynamicallyInvoked]System.Net.Web.Utility.UrlEncode w/ (CodePagesEncodingProvider.Instance.GetEncoding(1252) http://not-exist-website.doNotTry/hello-world+%F0%9F%96%A4%F0%9F%92--%9C%F0%9F%92%9C%F0%9F%96/
System.Uri.EscapeUriString http://not-exist-website.doNotTry/hello-world%20%C3%B0%C5%B8%E2%80%93%C2%A4%C3%B0%C5%B8%E2%80%99--%C5%93%C3%B0%C5%B8%E2%80%99%C5%93%C3%B0%C5%B8%E2%80%93/
System.Text.Encodings.Web.UrlEncoder.Default.Encode http://not-exist-website.doNotTry/hello-world%20%C3%B0%C5%B8%E2%80%93%C2%A4%C3%B0%C5%B8%E2%80%99--%C5%93%C3%B0%C5%B8%E2%80%99%C5%93%C3%B0%C5%B8%E2%80%93/

expected result

Same result as either first 2 rows via System.Text.Encodings.Web.UrlEncoder or System.Uri.EscapeUriString

davidsh commented 7 years ago

cc: @tarekgh