dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.23k stars 9.95k forks source link

asp.net core api Request.Form URL-encoded Chinese is not restored correctly. #33866

Open myunQ opened 3 years ago

myunQ commented 3 years ago

Environment description

运行时环境: OS Name: Windows OS Version: 10.0.19042 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\5.0.300\

Host (useful for support): Version: 5.0.6 Commit: 478b2f8c0e

.NET SDKs installed: 5.0.300 [C:\Program Files\dotnet\sdk]

.NET runtimes installed: Microsoft.AspNetCore.All 2.1.28 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.App 2.1.28 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 3.1.15 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 5.0.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.NETCore.App 2.1.28 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 3.1.15 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 5.0.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.WindowsDesktop.App 3.1.15 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Microsoft.WindowsDesktop.App 5.0.6 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

To install additional .NET runtimes or SDKs: https://aka.ms/dotnet-download


# Problem Description
There is a client which uses GB2312 character set encoding and uses https POST `Content-Type:application/x-www-form-urlencoded; charset=GBK` to call the API provided by me.   
The data submitted by the client POST is encoded by `System.Web.HttpUtility.UrlEncode("查重费用", Encoding.GetEncoding("gbk"));`   The encoded value is: `%b2%e9%d6%d8%b7%d1%d3%c3`.

The entire request message looks like this:

POST https://localhost:44334/api/test HTTP/1.1 User-Agent:Fiddler Everywhere Content-Type:application/x-www-form-urlencoded; charset=GBK Host:localhost:44334 Content-Length:27

su=%b2%e9%d6%d8%b7%d1%d3%c3


The value received by my API is indeed like this: `%B2%E9%D6ط%D1%D3%C3`.

The received code is like this: 

// Program.cs public class Program { public static void Main(string[] args) { // The point of displaying this code is to tell you that the character set is registered. Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

        CreateHostBuilder(args).Build().Run();
    }
            ... omit other codes ...
}

// TestController.cs [Route("api/[controller]")] [ApiController] public class TestController : ControllerBase { [HttpPost] public IActionResult Post() { // Here the correct value of Request.Form["su"] should be "查重费用", but it is actually "%B2%E9%D6ط%D1%D3%C3" return Content(Request.Form["su"]); } }


I see asp.net core source code [FormPipeReader.cs](https://github.com/dotnet/aspnetcore/blob/release/5.0/src/Http/WebUtilities/src/FormPipeReader.cs) 
    private string GetDecodedString(ReadOnlySpan<byte> readOnlySpan)
    {
        if (readOnlySpan.Length == 0)
        {
            return string.Empty;
        }
        else if (_encoding == Encoding.UTF8 || _encoding == Encoding.ASCII)
        {
             ... omit other codes ...
        }
        else
        {
            // Slow path for Unicode and other encodings.
            // Just do raw string replacement.
            var decodedString = _encoding.GetString(readOnlySpan);
            decodedString = decodedString.Replace('+', ' ');
            return Uri.UnescapeDataString(decodedString);
        }
    }


If replace `Uri.UnescapeDataString(decodedString)` with `System.Web.HttpUtility.UrlDecode(decodedString, _encoding)`, the problem I encountered can be solved.

Is it really correct to use `Uri.UnescapeDataString(string)` here?
BrennanConroy commented 3 years ago

Triage: We need to check if any encoding other than Unicode is allowed in form url encoding.

ghost commented 3 years ago

Thanks for contacting us.

We're moving this issue to the Next sprint planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s). If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

Tratcher commented 3 years ago

Hmm, the spec for application/x-www-form-urlencoded is outdated and doesn't directly address this issue. It says to use uri escaping as outlined in RFC 1738 which said to use %xx for non-ascii data but didn't specify the encoding. That was updated by 2396 and subsequently replaced by 3986 that says UTF-8 should be used.

It sounds like you control the client in this scenario? Is there a reason for that client to not use UTF-8?

myunQ commented 3 years ago

Hi @Tratcher !

I have no control over the client, which is a server application of another company. It is a call from the server to the server.

Tratcher commented 3 years ago

FYI the prior form reader also used UTF-8: https://github.com/dotnet/aspnetcore/blob/cee22edd8c27c1ac767a06005d19fd973ab7ee22/src/Http/WebUtilities/src/FormReader.cs#L270-L276

System.Web seems to scan the input bytes for % sequences, un-escape them to bytes, and then use the provided encoder to decode the whole byte sequence. https://referencesource.microsoft.com/#System.Web/HttpValueCollection.cs,216 https://referencesource.microsoft.com/#System.Web/Util/HttpEncoder.cs,535

We should make a change to be compatible with that System.Web implementation, but it's unclear if we'd break anyone in the other direction. Note I don't know that we'd also support the %uXXXX format from the System.Web implementation.

myunQ commented 3 years ago

Yes, because of the system connection this time, I have noticed that both .net core and .net 5 use the UTF-8 character set to decode the data read from the form.

But this will be a disaster for applications that do not use the UTF-8 character set and decode characters other than ASCII.

I now have to buffer the request body and read it again for a second decoding. This will affect the performance of the application a bit, especially when the form content is large.

Tratcher commented 3 years ago

Triage: We would accept a contribution for this but it's not currently a priority given how uncommon the situation seems to be. This is the first report since this code was written back in Katana ~8 years ago.

ghost commented 3 years ago

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

myunQ commented 3 years ago

Thanks.

Qos-xin commented 2 years ago

I have the same problem.