apache / lucenenet

Apache Lucene.NET
https://lucenenet.apache.org/
Apache License 2.0
2.21k stars 634 forks source link

Null pointers/nullable #737

Open nikcio opened 1 year ago

nikcio commented 1 year ago

I can see that most of the issues marked as "bugs" in SonarCloud are Null pointers should not be dereferenced which risks throwing a NullReferenceException

Issues: https://sonarcloud.io/project/issues?resolved=false&rules=csharpsquid%3AS2259&types=BUG&id=apache_lucenenet

As an example I looked into this issue: https://sonarcloud.io/project/issues?issues=AYRH0TRT_qq9ReJdi4pi&open=AYRH0TRT_qq9ReJdi4pi&id=apache_lucenenet

Here I tracked the newest version of the code down here: https://github.com/apache/lucene/blob/7c130d2f07e00fd8725cf7e22cc268dd4331fbbe/lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/ConvTable.java#L59

Which is similar to the version we have which should therefore mean that the same issue is still present in the Java version too? Therefore I'm wondering what's the best approach for fixing these errors or if they should be fixed at all. It's very possible that it's next to impossible to encounter this error in a real world scenario. (I haven't looked too much into how it's used).

And related to this, would it make sense to enable the Nullable feature across the project to give a informative public API for the developers using Lucene.NET? Here we could mark variables and parameters that can be null at some point. This should be possible to do step by step by adding the #nullable enable that I can see you've used in some files.

NightOwl888 commented 1 year ago

And related to this, would it make sense to enable the Nullable feature across the project to give a informative public API for the developers using Lucene.NET? Here we could mark variables and parameters that can be null at some point. This should be possible to do step by step by adding the #nullable enable that I can see you've used in some files.

Well, you've come to a similar conclusion that I have - nullable warnings, nullable reference types, guard clauses, and exception documentation comments are all intimately linked together. It is best if these are all dealt with together. Recently OfflineSorter was updated to include these in https://github.com/apache/lucenenet/pull/733/commits/c0ba34db214c7ca48c0914d2d80bc3b09abe12b9. The whole Lucene.Net.Spatial module was also done in https://github.com/apache/lucenenet/pull/619.

But it is a huge task because it requires careful analysis. If we average 30 minutes per file, it is about ~1500 hours of work.

Also, although this can technically be enabled one file at a time, it doesn't end up working out that way because anything a file is dependent upon that is newly #nullable enable is suddenly going to show warnings that would need to be addressed after the job was already "done" for a that file. It works out best to start at the base of the dependency tree and then work out to all of the submodules that depend on them.

Of course, we also have some classes that are higher in priority than others. Certainly, all of the main APIs used in the demos are the most important ones to enable nullable reference types on. There are several APIs where null is allowed as a parameter, so this would make it easier to understand that it is a valid value just by looking at Intellisense.

All of that being said, we can break this up into chunks to make it easier to manage, for sure.

Which is similar to the version we have which should therefore mean that the same issue is still present in the Java version too? Therefore I'm wondering what's the best approach for fixing these errors or if they should be fixed at all.

Asserts vs Guard Clauses

It is common in Java to simply let exceptions happen and then deal with them later. This is why most of the guard clauses are missing that would stop these null reference warnings showing up in the scan. Guard clauses are a thing in Java, but they are much less common than in .NET.

In Lucene, there are a ton of things that I would call "guard clauses" that were made into asserts (example). Since asserts can be enabled or disabled in Java, this sort of makes sense - the performance of checking whether values are valid is mitigated in production when asserts are disabled. However, in conversations I have had with @rclabo, we both agree that it doesn't make much sense to expect .NET users to have to enable guard clauses. They should always be there.

IMO, we should also add additional guard clauses that check for null even if they didn't exist in Lucene, which is a better approach to getting rid of these warnings than suppressing them. An ArgumentNullException or InvalidOperationException is an intentional design decision. A NullReferenceException is a bug.

NOTE: At least having an analyzer already pre-built that identifies these is one less custom analyzer we need to create to locate these issues, so that part of the job is done.

One more thing to consider is what impact changing the exception type might have on code that uses the API. We've done a review of exceptions in https://github.com/apache/lucenenet/pull/476/files, and I am fairly confident there won't be much impact with changing from AssertionError to ArgumentException or ArgumentNullException. ArgumentOutOfRangeException may take some additional care because in Java there are 3 different exceptions that map to 2 in .NET. So, we will need pretty thorough testing on each section that is changed to ensure we aren't catching errors where we shouldn't be. The extension methods used in catch blocks take care of much of it, but they are not 100% thorough because the mapping isn't perfect, not every catch block has been converted, and sometimes we simply needed a design change to make it function.

NOTE: Some of the IO components are set up to always throw ArgumentOutOfRangeException instead of ArgumentException when checking index and length to ensure they are within the bounds of the array (as is typical in .NET) because in Java these would throw IndexOutOfBoundsException.

Inheritance

Another thing to consider is how heavily Lucene depends on inheritance and the fact that when you make a #nullable decision on the API of the base class, it is enforced for every class that inherits it. That can go against the grain of Lucene's design where a base class has both subclasses that do not accept null and subclasses that do accept null. I haven't looked into this too thoroughly - there might be some way around it that I am unaware of.

An option under consideration for this particular case is simply to make them all allow null, but have them instantiate a new dictionary when null is passed to ensure it doesn't throw. This is an API design change, but would make the API easier to work with.

This really ought to be a IDictionary<TKey, TValue> rather than IDictionary, also, but that is another project.

rclabo commented 1 year ago

Man I don't know guys. That's a mountain of work and I'm not sure it adds a lot of value. It'd be one thing if we were experiencing lots of stability issues related to exceptions thrown due to dereferencing null references but that's just not the case. The Nullable feature is a nice add-on to the C# language but just because it's been added to the language doesn't mean that we need to implement it in LuceneNET. It would be really sad to see the LuceneNET production release delayed further to support the Nullable feature in my opinion.

nikcio commented 1 year ago

Yeah after the explanation above I think the same. It would be great to have at some point but it shouldn't be blocking any releases. And with an rough estimate of ~1500 as stated above it properly isn't something that would realistic to complete in a sensible timeframe without more time and funding added to the project. But I think it's great to have an issue tracking it so any progress can be referenced and so it can be used as a reference for new commers to the project.

rclabo commented 1 year ago

@nikcio I'd agree with that.

NightOwl888 commented 1 year ago

I don't think you guys appreciate the power that the scan gave us. It turned ~1500 hours of unspecified work with a vague notion of "let's fix the main APIs first" into a very targeted few dozen hours of work to hit the most problematic areas first. All while avoiding having to change exception behavior in a breaking way after the release.

Sure, this was near the bottom in terms of priority - and still is - but having a hard target to hit bumps it up a bit in my opinion. Clean up the squeaky wheels that the scan gave us, and then fix the rest at some point after the release (most notably, the APIs that allow null that the scan isn't showing us).

rclabo commented 1 year ago

@NightOwl888 Ok, I hear ya. But consider this. When was the last time you saw Lucene.NET throw an exception due to a null reference? Personally I've been running it for a couple years now and NEVER encountered such an exception. Just sayin.

NightOwl888 commented 1 year ago

Well, that is because you know what you are doin' :).

But those mere mortals who allow a null to pass into one of our APIs might be chasing their tail for awhile because the null gets passed down through 5 or so more APIs before it hits the one that throws. A guard clause makes the stack trace much more explicit as to where the problem lies. And who knows how long they will chase their tail if they don't even have a vague notion that they can turn on asserts.

What has been most enlightening about the scan is that it is pointing us to very specific areas that have a small problem, and in several instances it led to finding a much bigger problem in the same area. Even though the scan didn't pick it up the bigger problem as an issue.