dotnet / dotnet-api-docs

.NET API reference documentation (.NET 5+, .NET Core, .NET Framework)
https://docs.microsoft.com/dotnet/api/
Other
725 stars 1.56k forks source link

Methods which take `StringComparison` are not sufficiently clear about how hard they are to use correctly #8973

Open Smaug123 opened 1 year ago

Smaug123 commented 1 year ago

Methods which take StringComparison are not sufficiently clear about how hard they are to use correctly

I claim that it is easy to read e.g. https://learn.microsoft.com/en-us/dotnet/api/system.string.endswith?view=net-8.0 and come away thinking that "I'll be fine, I'll just stick with the invariant culture because I don't care about anything above code point 127". This is amply discussed in Best Practices for Using Strings, and the docs do vaguely suggest that perhaps you might want to read it; but I believe what is actually required in the Notes to Callers is "you must read Best Practices for Using Strings, in full, before using any method that can take a StringComparison".

In particular, for example, the following documentation (which is where one would naturally look to find the "non-printing characters are omitted in most configurations" footgun) is entirely unhelpful except insofar as it suggests that you might want to read the document whose purpose is to instil the correct sense of unease:

As explained in Best Practices for Using Strings, we recommend that you avoid calling string comparison methods that substitute default values and instead call methods that require parameters to be explicitly specified. To determine whether a string ends with a particular substring by using the string comparison rules of the current culture, call the EndsWith(String, StringComparison) method overload with a value of CurrentCulture for its comparisonType parameter.

I read this paragraph entirely as "string comparison is culture-specific, so be careful about which culture you're in", and not as the much more important "… and also bear in mind that you probably do not know what any given culture's string comparison rules are, even if you think you do".

I believe every method which has an overload which takes a StringComparison should contain a Note to Callers that warns the user that they must read the Best Practices. If I were writing the docs, they would say the following:

You must read Best Practices in full before using this function, even if you think you currently know what a culture-sensitive comparison is. Read it all; do not just read the section about the StringComparison you currently intend using. (You may skip this step if you can already articulate why the invariant culture is almost never the correct culture to use for string comparison and related operations.)

Target framework

True of .NET 5 through at least .NET 8.

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

Issue Details
**Methods which take `StringComparison` are not sufficiently clear about how hard they are to use correctly** I claim that it is easy to read e.g. https://learn.microsoft.com/en-us/dotnet/api/system.string.endswith?view=net-8.0 and come away thinking that "I'll be fine, I'll just stick with the invariant culture because I don't care about anything above code point 127". This is amply discussed in _Best Practices for Using Strings_, and the docs do vaguely suggest that perhaps you might want to read it; but I believe what is actually required in the Notes to Callers is "you *must* read _Best Practices for Using Strings_, *in full*, before using any method that can take a StringComparison". In particular, for example, the following documentation (which is where one would naturally look to find the "non-printing characters are omitted in most configurations" footgun) is entirely unhelpful except insofar as it suggests that you might want to read the document whose purpose is to instil the correct sense of unease: > As explained in [Best Practices for Using Strings](https://learn.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings), we recommend that you avoid calling string comparison methods that substitute default values and instead call methods that require parameters to be explicitly specified. To determine whether a string ends with a particular substring by using the string comparison rules of the current culture, call the [EndsWith(String, StringComparison)](https://learn.microsoft.com/en-us/dotnet/api/system.string.endswith?view=net-8.0#system-string-endswith(system-string-system-stringcomparison)) method overload with a value of [CurrentCulture](https://learn.microsoft.com/en-us/dotnet/api/system.stringcomparison?view=net-8.0#system-stringcomparison-currentculture) for its comparisonType parameter. I read this paragraph entirely as "string comparison is culture-specific, so be careful about which culture you're in", and *not* as the much more important "… and also bear in mind that you probably do not know what any given culture's string comparison rules are, even if you think you do". I believe *every* method which has an overload which takes a `StringComparison` should contain a Note to Callers that warns the user that they must read the Best Practices. If I were writing the docs, they would say the following: > You must read Best Practices *in full* before using this function, even if you think you currently know what a culture-sensitive comparison is. Read it all; do not just read the section about the `StringComparison` you currently intend using. (You may skip this step if you can already articulate why the invariant culture is almost never the correct culture to use for string comparison and related operations.) **Target framework** True of .NET 5 through at least .NET 8. - [x] .NET Core - [ ? ] .NET Framework - [x] .NET Standard
Author: Smaug123
Assignees: -
Labels: `untriaged`, `Pri3`, `area-System.Globalization`, `:watch: Not Triaged`
Milestone: -
MSDN-WhiteKnight commented 1 year ago

The suggested wording is useless as it consists solely of weasel words that do not add anything for the reader. API docs are supposed to be pretty consise. We should just state it is recommended to pass comparison explicitly and link to article that provides more details. It's not possible to understand all rules of culture-sensitive comparisons and i dont' think we should encourage it. Instead it's better for readers to focus on understanding scenarios where it's safe to use ordinal comparison (the majority of cases) and just stick with it.

The linked docs seems fine actually, maybe it should put more emphasis on Ordinal instead of CurrentCulture as it's least suprising.

Smaug123 commented 1 year ago

I entirely agree that we shouldn't encourage understanding all rules of culture-sensitive comparisons! My assertion is that the current docs are insufficiently clear that "you must use ordinal comparison unless you know exactly what you are doing". Note, for example, that the portion of the docs I quoted do not even mention the word "ordinal", and your implied suggestion ("it is recommended to pass comparison explicitly") also does not mention the word "ordinal".

There's a reason I said "If I were writing the docs…", by the way, which is that I have very strong opinions on documentation which are likely incompatible with many other people's. I gave that example so as to gesture to what I would have preferred the documentation to say; I intended it to be entirely unspecified how the documentation ends up saying it.

(Out of interest, could you define "weasel word" for me? I am not sure we agree on what the phrase means. To me, inherited from Wikipedia, weasel words are phrases which avoid saying anything but do so in such a way that they look like they're saying something. A sentence which starts "You must", or indeed any sentence in the imperative mood, will struggle very hard to be a weasel phrase under my understanding.)

Smaug123 commented 1 year ago

I would implore you to try and forget all your current understanding of the pitfalls of .NET strings, and come at this from the perspective of a user who is used to how every other language does it, i.e. non-culture-aware. Such people likely do not understand the (significant!) ramifications of the single sentence "This method performs a word (case-sensitive and culture-sensitive) comparison using the current culture.", and there's currently nothing on https://learn.microsoft.com/en-us/dotnet/api/system.string.endswith?view=net-8.0 which even suggests to such a user that they will make a mistake by default. (The important fact, "culture-sensitive", far from being emphasised, is in parentheses!) The only such clause is "we recommend that you avoid calling string comparison methods that substitute default values", and that could perfectly plausibly be intended as "… because it's generally good practice to be explicit rather than relying on implicit behaviour"; there is nothing in the current documentation to suggest that you will probably be writing actual bugs by default, unless you click through and read most of Best Practices. This is why I suggested being much more explicit that the user must read Best Practices.

tarekgh commented 1 year ago

It may be worth just adding a link to the doc https://learn.microsoft.com/en-us/dotnet/csharp/how-to/compare-strings when mentioning something like This method performs a word (case-sensitive and culture-sensitive or ordinal comparisons. Usually, users who are new or not familiar with the APIs will explore this link and get a better idea about the differences and the expectation.

Smaug123 commented 1 year ago

Sure, I think something like "See https://learn.microsoft.com/en-gb/dotnet/csharp/how-to/compare-strings for examples of culture sensitivity in string manipulation." would help a little. However, it is also pretty misleading, in the sense that this particular article starts with the heading "Default ordinal comparisons" and talks first about String.Equals and friends; that is, we would be linking to documentation which first discusses the only string comparison methods which do the obvious and expected thing by default.

I'm just trying to provide suggestions that help people either do the correct thing first time with the minimum of reading, or at the very least warn them (again with minimal reading) that they are probably about to do the wrong thing unless they do the reading. Again, nowhere in https://learn.microsoft.com/en-gb/dotnet/csharp/how-to/compare-strings does it mention that the default for most operations is the rather obscure "current culture" mode (except for array sorting, where it does mention it); it's not even implied by any of the examples. (That particular page also heavily uses InvariantCulture, which is specifically noted to be almost always incorrect in Best Practices; this further confuses the issue.)

It is still the case, for every page except Best Practices I've seen so far, that the statement "You will probably be introducing actual behavioural bugs if you use the defaults" is either not present, or is deeply buried and mentioned offhand. Moreover, all the exhortations I have seen so far to read Best Practices have been at best with the sentiment "consider reading this page", or "it is recommended to…"; nowhere does the documentation appear to say that it is required reading on pain of behavioural bugs. I still believe that even if we add a link to "comparing strings in C#", people will easily get the wrong impression.

tarekgh commented 1 year ago

The link can just point to the right section like https://learn.microsoft.com/en-gb/dotnet/csharp/how-to/compare-strings#linguistic-comparisons.

Smaug123 commented 1 year ago

Basically my point is: there is quite a lot of documentation, and yet over and over and over and over and over and over and over, people get this wrong. The current documentation is clearly missing something, and I assert that what it is missing is the statement "You will get this wrong unless you read Best Practices".

Smaug123 commented 1 year ago

If we link to the Linguistic Comparisons section, could we at least change its first sentence from

Strings can also be ordered using linguistic rules for the current culture.

to

Many string comparison methods default to using linguistic rules for the current culture to order their inputs (in contrast to String.Equals above), unless overridden.

Again, I'm just grasping for anything that warns the user that the default behaviour will do unexpected things (rather than merely saying "It's not recommended to use the default behaviour", which could mean anything, or omitting to mention it entirely).

tarekgh commented 1 year ago

I am not opposing adding or tweaking any sentence if the proposed change is correct.

tarekgh commented 1 year ago

Feel free to submit a PR requesting change for that sentence.

MSDN-WhiteKnight commented 1 year ago

Note, for example, that the portion of the docs I quoted do not even mention the word "ordinal", and your implied suggestion ("it is recommended to pass comparison explicitly") also does not mention the word "ordinal".

That's intended, ordinal is not suitable for all scenarios. When strings are for UI, comparisons should be culture aware to avoid another set of bugs where invisible chars (like zero-width spaces) accidently creep into input field and unexpectedly affect comparison.

there is nothing in the current documentation to suggest that you will probably be writing actual bugs by default, unless you click through and read most of Best Practices.

The docs recommend to pass the comparison mode explicitly. Once readers follow this recommendation, they are forced to think which mode they should pick. So they read more about them in the linked article, choose the right mode for their scenario, and therefore less likely to run into bugs. The fact they need to read best practices is sort of implied, otherwise it would not be linked.

(Out of interest, could you define "weasel word" for me? I am not sure we agree on what the phrase means.

I just meant it as "overly generic words that convey no useful information". Offtopic might be more accurate then weasel words. Regardless of how it's called, my point is that phrases like "You must read these docs in full" are useless, because they only needlessly "lecture" user what to do, but don't provide any technical information on documented APIs. So it's better to leave them off to keep docs concise. If there's related document, just "For more information on ..., see ..." is enough. It could be put in "Important" or "Note" block for more emphasis if it's really important for avoiding bugs.

Smaug123 commented 1 year ago

The trouble I find in general with the MSDN docs is that they don't really tell you what is important versus what is merely useful to know: they uniformly "recommend", they don't use the scale from "implore" to "suggest offhand". Of course, this stylistic decision can't be changed now, but we can work around it with appropriate wording. (The choice of that appropriate wording is of course not something I can or should make unilaterally; all I can do is identify how I, a user, misinterpreted what was there, and state what words would have taught me the correct thing.)

There's an inherent tension between your stated desires to "keep docs concise" and the requirement that users understand what is a really very complex part of the API. I do understand this, and I understand that you prefer to keep API docs short. What I consider to be absolutely necessary is an indication of the relative importance of the part of the documentation that says "for further information, read this background article". No distinction is currently drawn between the crucial parts and the almost-always-irrelevant parts. As long as this distinction is not made, the user is forced to read and digest every single piece of documentation with equal attention before they can safely use the documented features. Users do not, in fact, do this, and the result is dozens of bug reports against the runtime which are closed as "by design". Domains which are inherently complex sometimes simply cannot be represented concisely.

Just to check our base assumptions: do you agree that it is a problem with the documentation if people routinely read it and come away from it without understanding what it was necessary for them to understand? To me, such documentation is unfit for purpose regardless of the intentions with which it was created: the only possible benchmark for success of documentation is the extent to which its readers learn from it, and that is a question which can only be determined empirically by writing docs as best we can and seeing how real-world users respond to it. I assert that this particular experiment has come back with conclusive results in the form of the many specious bugs raised against the runtime, so we should revisit these docs.

MSDN-WhiteKnight commented 1 year ago

The trouble I find in general with the MSDN docs is that they don't really tell you what is important versus what is merely useful to know

There's actually a feature for it: tags like Important, Warning etc. There's example where they are used in API docs: https://learn.microsoft.com/en-us/dotnet/api/system.runtime.compilerservices.unsafe?view=net-7.0#remarks

There's an inherent tension between your stated desires to "keep docs concise" and the requirement that users understand what is a really very complex part of the API.

There's no tension. We keep API docs concise and link into conceptual docs that explain more complex details. This is already implemented for the linked article.

I assert that this particular experiment has come back with conclusive results in the form of the many specious bugs raised against the runtime, so we should revisit these docs.

I don't think all of these bugs you linked are due to insufficient docs. Some look like the author didn't read docs at all. Others look like author did read the docs, but disagree with some Unicode collation rules. Not enough for a conculsion to revise docs. But maybe we should think how to make this "Best practices for using strings" more discoverable, for example, link to it from more string-related articles.

Smaug123 commented 1 year ago

I've scoped this more tightly and raised https://github.com/dotnet/dotnet-api-docs/pull/8989 . I continue to believe that "the reader didn't read docs" is usually a problem with the docs and not with the reader, unless they literally didn't open the page, but I think I have a way of indicating "this is probably not what you expect" in very few words and without telling the reader to do anything new.

Smaug123 commented 1 year ago

Another piece of evidence that the documentation is still insufficient, by the way: someone who knew how and why StartsWith is hard to use correctly (because he'd just implemented an analyser to detect usages of StartsWith without an explicit culture) still used the invariant culture inappropriately when fixing the problem, in the issue linked above. It remains the case that you must read about 1/2 of Best Practices for Comparing Strings before you use these functions, and the documentation still does not tell you to do so (it merely quotes it as a source for one piece of advice it gives right at the end), and it does not tell you which 1/2 to read.