Open aersam opened 3 years ago
The ask here is basically raw read access, i.e. to retrieve nvarchar columns via SqlDataReader.GetFieldValue<byte>
(and possibly also SqlDataReader.GetBytes), so as to enable efficient string access without performing UTF8->UTF16 conversions. FWIW in Npgsql we ended up always allowing raw byte access to column data - this way users can always do what they want.
Note that this assumes SqlClient can get UTF8-encoded strings from SQL Server for varchar/nvarchar (which it would then expose directly as byte[]) - not sure if this is actually possible (thanks @vonzshik for pointing that out and looking into it).
Well, it nvarchar would return UTF-16 and it's documented, then it would be ok for me. My use case is really UTF8-varchar fields
The ask here is basically raw read access
I believe that the internal format of sql char data is UCS16 which is close to UTF16 but not the same, historic stuff. For some reason I also think that even utf8 data is sent in 16 bit chunks in-protocol so while storing as utf8 may be beneficial it'll need decoding.
It's worth exploring when we get a NET6 build.
If you're doing to be using System.Json and other string-free or low allocation techniques we could consider a UTF8Stream which returns you a byte streamable result.
I think that high-perf low-allocation APIs typically operate on ReadOnlySpan<byte>
rather than on a stream (e.g. Utf8JsonReader) - ideally you would be able to access a varchar/nvarchar in whatever way you can already access raw binary data (GetBytes/GetFieldValue, GetStream if you want a stream). Of course SqlClient could still be performing decoding behind the scenes, though if that's necessary that reduces the value of this to some extent.
If you wanted to read a 2Gib json string today I would advise using GetTextReader instead of getting the string. In the same way, if we add utf8 support I think we should add a streamed version as well as the ROS
Of course SqlClient could still be performing decoding behind the scenes, though if that's necessary that reduces the value of this to some extent.
Agreed. I think it is worth doing, if the user has utf8 data we should provide it in that form.
If you wanted to read a 2Gib json string today I would advise using GetTextReader instead of getting the string. In the same way, if we add utf8 support I think we should add a streamed version as well as the ROS or Utf8String based version.
Definitely not pushing back - having GetStream work on varchar/nvarchar columns could be useful (though I'm not sure how users would work with that data).
Reading around UTF8 support to see if i had my facts correct I found this extensive article which is well worth reading. https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ In general UTF8 in sql server is not a great idea and will hurt performance in some scenarios. That doesn't means we shouldn't support it for those who will need it, just that people need to be careful to use it appropriately.
@aersamkull Hi, Can you provide a use case of requested feature as a repro? That will help us to have a better understanding.
Thanks
Reading around UTF8 support to see if i had my facts correct I found this extensive article which is well worth reading. https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ In general UTF8 in sql server is not a great idea and will hurt performance in some scenarios. That doesn't means we shouldn't support it for those who will need it, just that people need to be careful to use it appropriately.
I work with SQL Server since years and what I learned is: Every feature has advantages and disadvantages and you should always try it out. But it's good to have options!
As the article says, there are cases where it is good for performance and storage size. In my use case I have a Webapp that receives data from an external source. It's a lot of text and it's English. English is of course the best fit for UTF8, but any Latin Language should profit. Of course a Slavic language is probably already problematic. I did a query saying I only need 50.2% of storage for some fields. But I could not go with varchar, you always have someone using a unicode char.
As English still rules the world today I personally think it is a feature that many people can benefit from. You won't find that many chars above ASCII 127 in this thread, right?
Now, my technical use case:
so, my source encoding is utf8, my target encoding is utf8. I can save some bandwidth between SQL Server and the Webserver and I expect this to affect users and Webserver memory as there are lots of requests.
This is something a lot of customers of us could profit from, especially when using SQL Azure (where I can't have the Webserver next to the SQL DB on the same physical server)
Reading around UTF8 support to see if i had my facts correct I found this extensive article which is well worth reading. https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ In general UTF8 in sql server is not a great idea and will hurt performance in some scenarios. That doesn't means we shouldn't support it for those who will need it, just that people need to be careful to use it appropriately.
Very interesting read, thanks @Wraith2! Though that article sounds a bit overly-negative... Specifically:
the secondary benefit (i.e. saving space) only happens under specific conditions otherwise can cause the data to expand in size,
It's very widely accepted that a huge number of scenarios involving text on the web involve ASCII-only characters, and using UTF-16 is simply a waste. I'm not sure if the original intent of UTF-8 was compatibility or space savings (I don't really care that much), but the space savings part is very, very real and is also why UTF-8 is quite standard in web data transfers etc. A great resource on this is http://utf8everywhere.org, which also points out to how the problems attributed to UTF-8 also exist in UTF-16 (which is also a variable-length encoding and not a 2-byte one).
there is typically a slight performance hit when using UTF-8 collations,
Would be good to understand exactly what this means etc.
But now we're just talking about other stuff :) In the context of this issue, assuming I understand everything correctly, if someone chooses to have a varchar
column with UTF-8 data, this would allow them to get the raw byte content directly and e.g. parse it into JSON without any sort of decoding/encoding anywhere.
assuming I understand everything correctly, if someone chooses to have a
varchar
column with UTF-8 data, this would allow them to get the raw byte content directly and e.g. parse it into JSON without any sort of decoding/encoding anywhere.
Yes, The goal would be to give users utf8 bytes if they know that they're storing utf8 bytes. I can't make any promises on the encoding/decoding part but the user experience should be that they get utf8 bytes.
Sounds good.
I can't make any promises on the encoding/decoding part
Just to note that if this would involve various encoding/decoding behind the scenes, this feature would like be a bad idea... The whole point here is to avoid extra conversions (perf), and to get raw data from the server directly into e.g. a JSON parser. If you need conversions to get a UTF8 you likely can just read a .NET string a parse that instead (though as always we can benchmark and see).
I had a look at the bytes the server sends and it's good news, they're not 16 bit it's actually sending the original bytes. This means we need to do some juggling to allow you to request a byte[] when the column metadata is a string type with the right encoding.
I think it would be nice to just use GetBytes and then have some GetEncoding Method for the encoding
Or add an overload to GetBytes with an Out Parameter for the Encoding (not ideal as the encoding does not change for a column)
I think it would be nice to just use GetBytes and then have some GetEncoding Method for the encoding
Hmmm, smells like scope creep. But now I've looked it does seem weird that there is no way to get the collation of a column or the derived encoding, even GetColumnSchema doesn't expose it. Even recent SO posts seem to say that private reflection needs to be used https://stackoverflow.com/questions/35190858/how-to-get-the-collation-of-string-fields-in-a-result-set-via-sqldatareader so providing a way to get the encoding is probably worth a separate issue. For this issue i'd say that you should only use GetBytes on columns if you know they're UTF8 encoded if your goal is performance because using the metadata api isn't going to be good for performance.
@Wraith2 very much agree. Developers can either know the encoding in advance (since they know the database), or they can discover a column's encoding from the system tables if they really want to. I wouldn't add any APIs for this optimization.
I'll add a new issue for the Encoding/Collation Information thing.
So allowing GetBytes
to just return raw bytes for a varchar/nvarchar column is ok for me as a first step - whatever encoding it might be.
implementation note. This will require changing how we deal with strings. Currently string fields are read into byte arrays then decoded and finally stored into the SqlBuffer for the col. To avoid a second decode step if bytes were requested we would need to skip the decode step and store the raw bytes into the SqlBuffer, then on first request of the string do the decode step and switch the SqlBuffer over to the new string representation. This change would also mean that unused string fields are not materialized which may be a performance advantage to some users.
As a hack when performance was critical and before sql server 2016, I use to manually convert varchar(8000)=>varbinary(8000) for a string splitter I wrote in sqlclr that was used in a bunch of etls and working around technical debt from previous developers.
From my testing with a varchar datasource, it was a substantial performance improvement to convert from varchar=>varbinary to avoid taking ascii text and turning it into ucs2. This hybrid model was pretty good because it avoided the marshalling and unmarshalling of text and having to pay for that ascii=>ucs2=>ascii cost over and over again. Here was the splitter and some testing code. I can't recommend this today of course, but this feature of reading text as binary would have been huge for me back then. https://gist.github.com/mburbea/5e142f846d0141c714a1#file-splitvarbinary-cs
See discussion in .Net Repo , specifically this comment
Is your feature request related to a problem? Please describe.
There is Utf8JsonWriter which allows us to write UTF8 Strings directly to the browser. If I get data from MS SQL Server which might be stored in UTF8 already, it's always converted to C# UTF16 and then back to UTF8
Describe the solution you'd like
I'd like to be able to get the ReadOnlySpan or byte[] of the UTF8 value in order to directly use it in Utf8JsonWriter (or whatever API). It would be ok for me to just get the raw byte[] value in the encoding of the server, meaning different results for varchar, varchar collate *utf8 and nvarchar.
Describe alternatives you've considered
Well, the alternative is to just not do it and accept the overhead produced