UTF8 Support via GetBytes or GetFieldValue

aersam commented 3 years ago

See discussion in .Net Repo , specifically this comment

Is your feature request related to a problem? Please describe.

There is Utf8JsonWriter which allows us to write UTF8 Strings directly to the browser. If I get data from MS SQL Server which might be stored in UTF8 already, it's always converted to C# UTF16 and then back to UTF8

Describe the solution you'd like

I'd like to be able to get the ReadOnlySpan or byte[] of the UTF8 value in order to directly use it in Utf8JsonWriter (or whatever API). It would be ok for me to just get the raw byte[] value in the encoding of the server, meaning different results for varchar, varchar collate *utf8 and nvarchar.

Describe alternatives you've considered

Well, the alternative is to just not do it and accept the overhead produced

roji commented 3 years ago

The ask here is basically raw read access, i.e. to retrieve nvarchar columns via SqlDataReader.GetFieldValue<byte> (and possibly also SqlDataReader.GetBytes), so as to enable efficient string access without performing UTF8->UTF16 conversions. FWIW in Npgsql we ended up always allowing raw byte access to column data - this way users can always do what they want.

roji commented 3 years ago

Note that this assumes SqlClient can get UTF8-encoded strings from SQL Server for varchar/nvarchar (which it would then expose directly as byte[]) - not sure if this is actually possible (thanks @vonzshik for pointing that out and looking into it).

aersam commented 3 years ago

Well, it nvarchar would return UTF-16 and it's documented, then it would be ok for me. My use case is really UTF8-varchar fields

Wraith2 commented 3 years ago

The ask here is basically raw read access

I believe that the internal format of sql char data is UCS16 which is close to UTF16 but not the same, historic stuff. For some reason I also think that even utf8 data is sent in 16 bit chunks in-protocol so while storing as utf8 may be beneficial it'll need decoding.

It's worth exploring when we get a NET6 build.

If you're doing to be using System.Json and other string-free or low allocation techniques we could consider a UTF8Stream which returns you a byte streamable result.

roji commented 3 years ago

I think that high-perf low-allocation APIs typically operate on ReadOnlySpan<byte> rather than on a stream (e.g. Utf8JsonReader) - ideally you would be able to access a varchar/nvarchar in whatever way you can already access raw binary data (GetBytes/GetFieldValue, GetStream if you want a stream). Of course SqlClient could still be performing decoding behind the scenes, though if that's necessary that reduces the value of this to some extent.

Wraith2 commented 3 years ago

If you wanted to read a 2Gib json string today I would advise using GetTextReader instead of getting the string. In the same way, if we add utf8 support I think we should add a streamed version as well as the ROS or Utf8String based version.

Of course SqlClient could still be performing decoding behind the scenes, though if that's necessary that reduces the value of this to some extent.

Agreed. I think it is worth doing, if the user has utf8 data we should provide it in that form.

roji commented 3 years ago

If you wanted to read a 2Gib json string today I would advise using GetTextReader instead of getting the string. In the same way, if we add utf8 support I think we should add a streamed version as well as the ROS or Utf8String based version.

Definitely not pushing back - having GetStream work on varchar/nvarchar columns could be useful (though I'm not sure how users would work with that data).

Wraith2 commented 3 years ago

Reading around UTF8 support to see if i had my facts correct I found this extensive article which is well worth reading. https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ In general UTF8 in sql server is not a great idea and will hurt performance in some scenarios. That doesn't means we shouldn't support it for those who will need it, just that people need to be careful to use it appropriately.

JRahnama commented 3 years ago

@aersamkull Hi, Can you provide a use case of requested feature as a repro? That will help us to have a better understanding.

Thanks

aersam commented 3 years ago

Reading around UTF8 support to see if i had my facts correct I found this extensive article which is well worth reading. https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ In general UTF8 in sql server is not a great idea and will hurt performance in some scenarios. That doesn't means we shouldn't support it for those who will need it, just that people need to be careful to use it appropriately.

I work with SQL Server since years and what I learned is: Every feature has advantages and disadvantages and you should always try it out. But it's good to have options!

As the article says, there are cases where it is good for performance and storage size. In my use case I have a Webapp that receives data from an external source. It's a lot of text and it's English. English is of course the best fit for UTF8, but any Latin Language should profit. Of course a Slavic language is probably already problematic. I did a query saying I only need 50.2% of storage for some fields. But I could not go with varchar, you always have someone using a unicode char.

As English still rules the world today I personally think it is a feature that many people can benefit from. You won't find that many chars above ASCII 127 in this thread, right?

Now, my technical use case:

I have a lot of text I receive from an external API and store it to the database. As I do not get the max size, our tool auto adjusts the columns including varchar/nvarchar. That's just a detail but one one point for UTF8: It often will help us avoid nvarchar(MAX) and instead use varchar(8000) which is way faster. By the way: We use SqlBulkCopy for this, we should support UTF8 there, too :)
I have a stored procedure (could be a view, too) that then outputs that text data. Text is not transformed, but records are filtered by some business criteria
The web app retrieves the text data and genereates Json using the new Utf8JsonWriter (as it is critical to performance I avoid Web API here)

so, my source encoding is utf8, my target encoding is utf8. I can save some bandwidth between SQL Server and the Webserver and I expect this to affect users and Webserver memory as there are lots of requests.

This is something a lot of customers of us could profit from, especially when using SQL Azure (where I can't have the Webserver next to the SQL DB on the same physical server)

roji commented 3 years ago

Reading around UTF8 support to see if i had my facts correct I found this extensive article which is well worth reading. https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ In general UTF8 in sql server is not a great idea and will hurt performance in some scenarios. That doesn't means we shouldn't support it for those who will need it, just that people need to be careful to use it appropriately.

Very interesting read, thanks @Wraith2! Though that article sounds a bit overly-negative... Specifically:

the secondary benefit (i.e. saving space) only happens under specific conditions otherwise can cause the data to expand in size,

It's very widely accepted that a huge number of scenarios involving text on the web involve ASCII-only characters, and using UTF-16 is simply a waste. I'm not sure if the original intent of UTF-8 was compatibility or space savings (I don't really care that much), but the space savings part is very, very real and is also why UTF-8 is quite standard in web data transfers etc. A great resource on this is http://utf8everywhere.org, which also points out to how the problems attributed to UTF-8 also exist in UTF-16 (which is also a variable-length encoding and not a 2-byte one).

there is typically a slight performance hit when using UTF-8 collations,

Would be good to understand exactly what this means etc.

But now we're just talking about other stuff :) In the context of this issue, assuming I understand everything correctly, if someone chooses to have a varchar column with UTF-8 data, this would allow them to get the raw byte content directly and e.g. parse it into JSON without any sort of decoding/encoding anywhere.

Wraith2 commented 3 years ago

assuming I understand everything correctly, if someone chooses to have a varchar column with UTF-8 data, this would allow them to get the raw byte content directly and e.g. parse it into JSON without any sort of decoding/encoding anywhere.

Yes, The goal would be to give users utf8 bytes if they know that they're storing utf8 bytes. I can't make any promises on the encoding/decoding part but the user experience should be that they get utf8 bytes.

roji commented 3 years ago

Sounds good.

I can't make any promises on the encoding/decoding part

Just to note that if this would involve various encoding/decoding behind the scenes, this feature would like be a bad idea... The whole point here is to avoid extra conversions (perf), and to get raw data from the server directly into e.g. a JSON parser. If you need conversions to get a UTF8 you likely can just read a .NET string a parse that instead (though as always we can benchmark and see).

Wraith2 commented 3 years ago

I had a look at the bytes the server sends and it's good news, they're not 16 bit it's actually sending the original bytes. This means we need to do some juggling to allow you to request a byte[] when the column metadata is a string type with the right encoding.

aersam commented 3 years ago

I think it would be nice to just use GetBytes and then have some GetEncoding Method for the encoding

Or add an overload to GetBytes with an Out Parameter for the Encoding (not ideal as the encoding does not change for a column)

Wraith2 commented 3 years ago

I think it would be nice to just use GetBytes and then have some GetEncoding Method for the encoding

Hmmm, smells like scope creep. But now I've looked it does seem weird that there is no way to get the collation of a column or the derived encoding, even GetColumnSchema doesn't expose it. Even recent SO posts seem to say that private reflection needs to be used https://stackoverflow.com/questions/35190858/how-to-get-the-collation-of-string-fields-in-a-result-set-via-sqldatareader so providing a way to get the encoding is probably worth a separate issue. For this issue i'd say that you should only use GetBytes on columns if you know they're UTF8 encoded if your goal is performance because using the metadata api isn't going to be good for performance.

roji commented 3 years ago

@Wraith2 very much agree. Developers can either know the encoding in advance (since they know the database), or they can discover a column's encoding from the system tables if they really want to. I wouldn't add any APIs for this optimization.

aersam commented 3 years ago

I'll add a new issue for the Encoding/Collation Information thing. So allowing GetBytes to just return raw bytes for a varchar/nvarchar column is ok for me as a first step - whatever encoding it might be.

Wraith2 commented 3 years ago

implementation note. This will require changing how we deal with strings. Currently string fields are read into byte arrays then decoded and finally stored into the SqlBuffer for the col. To avoid a second decode step if bytes were requested we would need to skip the decode step and store the raw bytes into the SqlBuffer, then on first request of the string do the decode step and switch the SqlBuffer over to the new string representation. This change would also mean that unused string fields are not materialized which may be a performance advantage to some users.

mburbea commented 3 years ago

As a hack when performance was critical and before sql server 2016, I use to manually convert varchar(8000)=>varbinary(8000) for a string splitter I wrote in sqlclr that was used in a bunch of etls and working around technical debt from previous developers.

From my testing with a varchar datasource, it was a substantial performance improvement to convert from varchar=>varbinary to avoid taking ascii text and turning it into ucs2. This hybrid model was pretty good because it avoided the marshalling and unmarshalling of text and having to pay for that ascii=>ucs2=>ascii cost over and over again. Here was the splitter and some testing code. I can't recommend this today of course, but this feature of reading text as binary would have been huge for me back then. https://gist.github.com/mburbea/5e142f846d0141c714a1#file-splitvarbinary-cs

dotnet / SqlClient