Closed Eilon closed 4 years ago
var text = "hello-world 💟";
This is not how you would correctly represent any emoji as a C# string
. Also, I don't think you can represent emojis in Windows 1252.
Considering that %f0%9f%92%9f
is URL encoded UTF-8 for U+1F49F HEART DECORATION, I'm going to assume that's the emoji you want.
Are you sure the file you're reading is encoded using 1252? If I convert "💟"
back to bytes using 1252 and then interpret the bytes as UTF-8, I get the string for U+1F49F HEART DECORATION:
var bytes = Encoding.GetEncoding(1252).GetBytes("💟");
var s = Encoding.UTF8.GetString(bytes).Dump(); // prints "💟"
Using Uri.EscapeUriString()
on that will then give the result you're expecting:
System.Uri.EscapeUriString(s).Dump(); // prints "%F0%9F%92%9F"
This is not how you would correctly represent any emoji as a C# string
Agreed. However HTMLs are encoded and have to be able to read and use it... Even further I realized that some of those emojis are missing characters (they should each have to have 4 however they have 3 etc.)
Anyways, It is really interesting that I picked an emoji that actually resolves under your approach. However it doesn't work for others.
Since I can not update the question itself, here is a new sample that doesn't work.
var text = $"hello-world 🖤ðŸ’--œðŸ’œðŸ–";
Encoding sourceEncoding = Encoding.GetEncoding(1252);
// method 1 // hello-world+%f0%9f%96%a4%f0%9f%92--%9c%f0%9f%92%9c%f0%9f%96 // <----- result, I would like to get
System.Web.HttpUtility.UrlEncode(text, sourceEncoding).Dump();
// method 2 // hello-world+%c3%b0%c5%b8%e2%80%93%c2%a4%c3%b0%c5%b8%e2%80%99--%c5%93%c3%b0%c5%b8%e2%80%99%c5%93%c3%b0%c5%b8%e2%80%93
System.Web.HttpUtility.UrlEncode(text).Dump();
// method 3 // hello-world+%C3%B0%C5%B8%E2%80%93%C2%A4%C3%B0%C5%B8%E2%80%99--%C5%93%C3%B0%C5%B8%E2%80%99%C5%93%C3%B0%C5%B8%E2%80%93
System.Net.WebUtility.UrlEncode(text).Dump();
// method 4 // hello-world%20%C3%B0%C5%B8%E2%80%93%C2%A4%C3%B0%C5%B8%E2%80%99--%C5%93%C3%B0%C5%B8%E2%80%99%C5%93%C3%B0%C5%B8%E2%80%93
System.Uri.EscapeUriString(text).Dump();
// method5 // hello-world%20%F0%9F%96%A4%EF%BF%BD--%EF%BF%BD%F0%9F%92%9C%EF%BF%BD
System.Uri.EscapeUriString(Encoding.UTF8.GetString(sourceEncoding.GetBytes(text))).Dump();
// method6 // hello-world+%F0%9F%96%A4%EF%BF%BD--%EF%BF%BD%F0%9F%92%9C%EF%BF%BD
System.Net.WebUtility.UrlEncode(Encoding.UTF8.GetString(sourceEncoding.GetBytes(text))).Dump();
Anyways, It is really interesting that I picked an emoji that actually resolves under your approach. However it doesn't work for others.
The examples where it doesn't work aren't valid UTF-8 emoji:
%f0%9f%96%a4
is correctly encoded U+1F5A4 BLACK HEART.%f0%9f%92-
is not valid UTF-8: F0 indicates that the whole code point will be represented using 4 bytes, but -
is not a valid continuation byte.Because of this, Encoding.UTF8.GetString()
fails to decode the string and produces U+FFFE Unicode Noncharacter (UTF-8 encoding 0xEF 0xBF 0xB
).
So it looks like you have something that cannot be correctly represented as a string
. If you need to work with this kind of broken input, you might have to stay away from built-in methods like Uri.EscapeUriString()
and write the code to do that yourself.
In case the input is not supposed to be broken like this, what emoji was it supposed to represent?
I hear you and I appreciate your effort here. All your finding are absolutely correct.
I shouldn't mention the emoji part, regardless of what it is, encoder should be able to encode those characters like System.Web.HttpUtility.UrlEncode
does.
So unless there is no new way to handle this, question will be why not the new one behaves the same way?
Can you achieve the same by changing the encoding first (System.Text.Encoding.Convert
)), then using the System.Uri.Escape
method?
Also it seems that the method is back in .NET Core 2.0: System.Web.HttpUtility.UrlEncode
That should solve your problems.
@karelz here it is, no luck 🍀
// hello-world%20%C3%B0%C5%B8%E2%80%93%C2%A4%C3%B0%C5%B8%E2%80%99--%C5%93%C3%B0%C5%B8%E2%80%99%C5%93%C3%B0%C5%B8%E2%80%93
var text1252 = sourceEncoding.GetBytes(text);
var utf8Bytes = System.Text.Encoding.Convert(sourceEncoding, Encoding.UTF8, text1252);
var utf8String = Encoding.UTF8.GetString(utf8Bytes);
var escapeUri = System.Uri.EscapeUriString(utf8String);
escapeUri.Dump();
@karelz as you stated System.Web.HttpUtility.UrlEncode
exist in .NetCore2.0 (also in .NetStandard2.0) and solves my problem. Thank you! 🤗
But I would like to use this opportunity to point out that outputs between those methods are different.
I mean
System.Text.Encodings.Web.UrlEncoder.Default.Encode
vs (System.Web.HttpUtility.UrlEncode
and Dynamically InvokedSystem.Net.WebUtility.UrlEncode
)
And I believe there is a missing implementation somewhere. (or maybe just a missing documentation that explains how to do it with the new method)
Side notes;
Dynamically Invoked System.Net.Web.Utility.UrlEncode
is providing exactly same output as System.Web.HttpUtility.UrlEncode
but that method is not invocable in .NetCore2.0
System.Web.HttpUtility.UrlEncode
and System.Net.WebUtility.UrlEncode
are completely null
under .NetCore2.0 (you may find the screenshots below)
System.Web.HttpUtility.UrlEncode
does. (reference purpose)/// <summary>
/// System.Text.Encodings.Web.UrlEncoder.Create(UnicodeRanges.MiscellaneousSymbols);
/// </summary>
public sealed class HttpUtility
{
/// <summary>Encodes a URL string using the specified encoding object.</summary>
/// <param name="str">The text to encode. </param>
/// <param name="e">The <see cref="T:System.Text.Encoding" /> object that specifies the encoding scheme. </param>
/// <returns>An encoded string.</returns>
public static string UrlEncode(string str, Encoding e)
{
if (str == null)
return (string)null;
return Encoding.ASCII.GetString(HttpUtility.UrlEncodeToBytes(str, e));
}
/// <summary>Converts a string into a URL-encoded array of bytes using the specified encoding object.</summary>
/// <param name="str">The string to encode </param>
/// <param name="e">The <see cref="T:System.Text.Encoding" /> that specifies the encoding scheme. </param>
/// <returns>An encoded array of bytes.</returns>
public static byte[] UrlEncodeToBytes(string str, Encoding e)
{
if (str == null)
return (byte[])null;
byte[] bytes = e.GetBytes(str);
return HttpEncoder.Current.UrlEncode(bytes, 0, bytes.Length, false);
}
}
I am not sure what you mean by "not invokeable". The "null" implementations are just reference assembly code. Not the real implementation code.
Closing as resolved per reply above.
Thanks for the explanation, makes more sense now.
About "not invocable"; I meant following code returns null
in .NetCore2
MethodInfo method = typeof(System.Net.WebUtility).GetMethod(nameof(System.Net.WebUtility.UrlEncode),
BindingFlags.Static | BindingFlags.NonPublic,
null,
new[]
{
typeof(byte[]),
typeof(int),
typeof(int)
},
null);
@cilerler I see that you are referring to using Reflection when you are refer to "invocable".
Please keep in mind that using Reflection is not a supported technique in terms of .NET. While it might be a stopgap measure to use Reflection at times, internal changes to .NET Framework or .NET Core may result in a particular Reflection approach no longer working. In general, bug fixes, security fixes, feature changes will affect Reflection. So, it is not a supported thing that remains consistent between versions. Only the public API surface is supported.
Thank you @davidsh.
Since that method is only available through reflection (and as you said it is not the ideal way to handle things) and System.Text.Encodings.Web.UrlEncoder.Default.Encode
is not generating the same result that System.Web.HttpUtility.UrlEncode
generates, 🚨 I believe this issue should be stay as open.
Also is there a place to see UrlEncode
method/s in System.Net.WebUtility
in .NetCore2.0.
Looking for this line to be exact but that one is for .NetFramework4.7 and couldn't find System.Net.WebUtility
source in corefx neither.
Calling internal/private methods was never supported anywhere.
I am lost in what is different between what and why it matters. Can you restate the issue? (from scratch, don't link previous comments)
http://not-exist-website.doNotTry/hello-world 🖤ðŸ’--œðŸ’œðŸ–
Encoder | Encoded Value |
---|---|
System.Web.HttpUtility.UrlEncode w/ (CodePagesEncodingProvider.Instance.GetEncoding(1252)) | http://not-exist-website.doNotTry/hello-world+%f0%9f%96%a4%f0%9f%92--%9c%f0%9f%92%9c%f0%9f%96/ |
[DynamicallyInvoked]System.Net.Web.Utility.UrlEncode w/ (CodePagesEncodingProvider.Instance.GetEncoding(1252) | http://not-exist-website.doNotTry/hello-world+%F0%9F%96%A4%F0%9F%92--%9C%F0%9F%92%9C%F0%9F%96/ |
System.Uri.EscapeUriString | http://not-exist-website.doNotTry/hello-world%20%C3%B0%C5%B8%E2%80%93%C2%A4%C3%B0%C5%B8%E2%80%99--%C5%93%C3%B0%C5%B8%E2%80%99%C5%93%C3%B0%C5%B8%E2%80%93/ |
System.Text.Encodings.Web.UrlEncoder.Default.Encode | http://not-exist-website.doNotTry/hello-world%20%C3%B0%C5%B8%E2%80%93%C2%A4%C3%B0%C5%B8%E2%80%99--%C5%93%C3%B0%C5%B8%E2%80%99%C5%93%C3%B0%C5%B8%E2%80%93/ |
Same result as either first 2 rows via System.Text.Encodings.Web.UrlEncoder
or System.Uri.EscapeUriString
cc: @tarekgh
From @cilerler on August 20, 2017 22:13
I have a text that retrieved from the web and saved into a file with encoding 1252.
Text has URL's in it with some emoji's. However unless I get the result from method1 below, I can not be able to retrieve those pages.
Since we don't have System.Web in dotnet-core and no support from System.Net.WebUtility for TextEncoding,
I wonder, if you may provide me a simple example how to achieve this via new UrlEncoder.
Thanks in advance 🤗
Copied from original issue: aspnet/HtmlAbstractions#46