AngleSharp / AngleSharp.Diffing

A library that makes it possible to compare two AngleSharp node lists and get a list of differences between them.

MIT License

37 stars 6 forks source link

Implement custom diffs. #35

Closed SebastianStehle closed 1 year ago

SebastianStehle commented 1 year ago

New Feature Proposal

Description

At the moment the diffs are not very meaningful. If you implement a custom diff logic, you cannot tell later what kind of diff it actually is. So to use the diffs you need a lot of switches and you have to repeat the logic of the filters.

e.g. we use this code: https://github.com/SebastianStehle/mjml-net/blob/5bbb6f3c9e97a33a8657a50d9921cb779d92fba5/Tests/Internal/AssertHelpers.cs#L134

Furthermore the texts for the diffs are only available via DebuggerDisplay attribute.

What I would like to see is:

More types for diffs (e.g. each case you should be its own class.
Implement "ToString()" instead of debugger display.
Override the diff in the filters.

e.g. the FilterStrategy could be changed to:

public delegate (IDiff? Diff, FilterDecision) FilterStrategy<TSource>(in TSource source, FilterDecision currentDecision)

Because the version is lower than 1.0 I do not see a problem to just change that, but it would not be a problem to add a new filter strategy and to implement the old logic using the new logic.

egil commented 1 year ago

A few thoughts:

what types IDiff do you think is currently missing? We have the the following types currently: https://github.com/AngleSharp/AngleSharp.Diffing/tree/devel/src/AngleSharp.Diffing/Core/Diffs. Perhaps we need additional properties on IDiff?
I'm not against adding a ToString that writes out a generic description of a diff. However, users are likely to want to control the output anyway, perhaps because they want it in a different language, or they want it to have a different structure (markdown/json/xml), so I do not see much value. One diff description is not going to be universally applicable.
I am not sure how you would override diff in filter. Filtering happens before comparison, see https://github.com/AngleSharp/AngleSharp.Diffing/wiki/Difference-Engine-Internals

I'm not opposed to changing things, but we need to think it through thoroughly before commiting to a redesign.

SebastianStehle commented 1 year ago

what types IDiff do you think is currently missing? We have the the following types currently: https://github.com/AngleSharp/AngleSharp.Diffing/tree/devel/src/AngleSharp.Diffing/Core/Diffs. Perhaps we need additional properties on IDiff?

I do not understand every line of code yet, but what I see from NodeDiff is:

DifferentNodeType
DifferentNodeClosing
DifferentStylesheet (from StyleSheetTextNodeComparer?)
DifferentNodeText

on Attributes

DifferentClasses
DifferentBooleanAttribute
DifferentStyle

I think I have this code for diff formatting from you and I am not sure why there are 4 cases for NodeDiff.

Whether a property or type is the right way, I don't know. But the benefit with extra types is the easy ToString() implementation (no if's)

I'm not against adding a ToString that writes out a generic description of a diff. However, users are likely to want to control the output anyway, perhaps because they want it in a different language, or they want it to have a different structure (markdown/json/xml), so I do not see much value. One diff description is not going to be universally applicable.

Of course, but if you are writing a technical tool you might just be fine with the ToString() representation. It works for the debugger and it would make my life easier.

I am not sure how you would override diff in filter. Filtering happens before comparison, see https://github.com/AngleSharp/AngleSharp.Diffing/wiki/Difference-Engine-Internals

I mean comparers. I have just mixed the terms.

egil commented 1 year ago

Right now, the diff types are generic, they represent generic cases, something (node/attr) is there that should not be, something (node/attr) is missing, and something (node/attr) is different.

The API allow you to add custom rules (matchers/filters/compares) that work with those. That makes the current solution extensible and users are able to extend within those bounds.

It does not make sense for us to add new 1st party specialized IDiff types without also changing the design of the whole comparer part of the pipeline to also allow 3rd parties to include their own IDiff types and have those work with the pipeline.

So, a design is needed.

Wrt. making it easier to iterate through a diff list, we could consider adding a DiffKind property to IDiff, it may provide an alternative to switch over the diff type. But then again, it that would not really improve things.

Another approach is to implement a IDiff visitor (aka the visitor pattern). Then you replace switches with a visitor.

Either way, when a diff is reported, e.g. in the case of different closing tags, that should be possible to visualize to the end user. If we can get the info during comparison, it should be visible during printing of a diff, i.e. through the element.SourceReference, and accessing the raw control/test source.

SebastianStehle commented 1 year ago

Yes, exactly and the generic approach is my problem with the current design.

An easy solution could be to change the comparer strategy, similar to what I mentioned above.

For example to:

record CompareStatus<TDiff>(TDiff? Diff, CompareResult Result) where TDiff : IDiff;

public delegate CompareStatus<TDiff> CompareStrategy<TComparison, TDiff>(in TComparison comparison, CompareStatus<TDiff> currentDecision) where TDiff : IDiff;

or to

record struct CompareStatus<TDiff>(TDiff? Diff, CompareResult Result) where TDiff : IDiff;

public delegate CompareStatus<TDiff> CompareStrategy<TComparison, TDiff>(in TComparison comparison, CompareStatus<TDiff> currentDecision) where TDiff : IDiff;

The only difference is whether a diff can be null or not. If it can be null the status could also be a struct. I am not happy with the term "Status" yet. If null is not valid, a "NullDiff" or "NoDiff" type is needed.

Of course we could introduce some constants for convenience.

record struct CompareStatus<TDiff>(TDiff? Diff, CompareResult Result) where TDiff : IDiff
{
   public readonly CompareStatus<TDiff> Same = new CompareStatus<TDiff>(null, CompareResult.Same);
}

or introduce a concrete type for NodeComparer

// Testing another name.
record struct NodeCompareDecision(NodeDiff? Diff, CompareResult Result) where TDiff : IDiff
{
   public readonly NodeCompareDecision Same = new NodeCompareDecision(null, CompareResult.Same);
}

The implementation would not change that much:

private IEnumerable<IDiff> CompareElement(in Comparison comparison)
        {
            var result = new List<IDiff>();

            var (diff, compareRes) = _diffingStrategy.Compare(comparison); // See HERE
            if (compareRes.HasFlag(CompareResult.Different))
            {
                result.Add(diff ?? new NodeDiff(comparison)); // See Here
            }

            if (!compareRes.HasFlag(CompareResult.Skip))
            {
                if (!compareRes.HasFlag(CompareResult.SkipAttributes))
                    result.AddRange(CompareElementAttributes(comparison));
                if (!compareRes.HasFlag(CompareResult.SkipChildren))
                    result.AddRange(CompareChildNodes(comparison));
            }

            return result;
        }

By enforcing that the diff needs to be NodeDiff or AttributeDiff, we also ensure that important properties are always set.

Another approach is to implement a IDiff visitor (aka the visitor pattern). Then you replace switches with a visitor.

I think it is not needed. The visitor pattern comes in handy, when it also iterates through a deep object hierarchy, but not that important here. I do not have a problem with the switch, I just do not want to repeat logic from the comparer. And if ToString() would be implemented, I would not even need the switch anymore (but I can only speak for myself here).

Btw: Perhaps Comparison should be renamed to NodeComparison?

egil commented 1 year ago

@SebastianStehle could the generalized IDiff types not have a custom ToString() method, or perhaps GetDiffText() method, that will do the work of printing out the difference between nodes/attributes?

Also, a visitor pattern approach could also be worthwhile as an alternative to the ToString override. We could provide a default visitor that visits all diffs in an IEnumerable<IDiff> and prints out whatever text we think is a good default for each diff case, but that would allow 3rd parties to provide their own visitor for localization etc..

Embedding "presentation" logic (ToString) that suites one particular use case inside the diff types does not feel right to me, even if that use case is likely to be a common one. Separation of concerns and all that. There is a reason why MVC/MVVM/MVP patterns etc. are a thing after all.

I am not opposed to having the comparers return a readonly record struct CompareResult<TDiff>(TDiff? Diff, ComparisonStatus Status) where TDiff : IDiff, class (would probably rename the enum CompareResult to ComparisonStatus). Moving the responsibility for picking the diff type to each comparer does allow them to embed more detail about the diff, which is not a bad thing. And the overhead is probably negotiable since when there is a diff, a diff type will always be created eventually.

We just have to remember that there are more people than you and I use this, so if we can avoid breaking changes and still solve the problem, then that is preferred.

SebastianStehle commented 1 year ago

Also, a visitor pattern approach could also be worthwhile as an alternative to the ToString override. We could provide a default visitor that visits all diffs in an IEnumerable and prints out whatever text we think is a good default for each diff case, but that would allow 3rd parties to provide their own visitor for localization etc..

It is up to you. I do not see that much value in that. If you also want to support the debugger you end up with ToString() again or duplicate texts and then the only advantage of a visitor is to expose which diff are actually available.

Edit: I think a lot of types have ToString() implementations that are not suitable for all cases, e.g. Enums.

egil commented 1 year ago

It is up to you. I do not see that much value in that. If you also want to support the debugger you end up with ToString() again or duplicate texts and then the only advantage of a visitor is to expose which diff are actually available.

I am more prone to just using switch/pattern matching in C# instead of the visitor pattern myself too. Still like to keep the "presentation" of a diff external to the diff type itself, at least for the built-in diff types. With your proposed CompareResult struct change, users are in theory able to just provide their own comparers and diff types and do what they find best.

To ease printing out diffs, the documentation could provide a default implementation of a method with a pattern/switch case over the built-in diff types that users could pick out.

Wrt. to custom/specialized diff types, I think they should be placed in the same folder as the comparer, next to the comparer itself, and only keep the generalized ones in the "/core/diffs/" folder. The comparer specific diffs should then inherit from either of the generalized diffs (AttrDiff, NodeDiff, UnexpectedAttrDiff, MissingAttrDiff, UnexpectedNodeDiff, MissingAttrDiff). That will enable users that switch over the diff types to just switch over those generalized types by default, and perhaps add one or two specialized diff types to their switch if they need to do so, as with your closing tag example.

SebastianStehle commented 1 year ago

Perhaps you just want to comment in the PR I have created?

To ease printing out diffs, the documentation could provide a default implementation of a method with a pattern/switch case over the built-in diff types that users could pick out.

I think the ToString() is a good default and way better then just returning "NodeDiff" or so. It works in the debugger and makes it easy to actually discover the meaning if you dump it to Console.Out or so. It is the default and most objects work like this. I do not see any disadvantages here.

If you use want to show the diffs to end users you have very likely a lot more requirements. For example, you want to convert it to Html and perhaps you also need line number and column.

Wrt. to custom/specialized diff types, I think they should be placed in the same folder as the comparer, next to the comparer itself, and only keep the generalized ones in the "/core/diffs/" folder.

I was thinking about this as well. BUT as a library user I like it when I do not have to add dozens of imports. E.g. I hate "Exceptions"-namespaces. The strategies namespaces contain implementation details and usually I do not need to know that. I think it would actually make more sense to move the diffs to the root namespace.

egil commented 1 year ago

Perhaps you just want to comment in the PR I have created?

Ill start looking at it now and provide some input there.

To ease printing out diffs, the documentation could provide a default implementation of a method with a pattern/switch case over the built-in diff types that users could pick out.

I think the ToString() is a good default and way better then just returning "NodeDiff" or so. It works in the debugger and makes it easy to actually discover the meaning if you dump it to Console.Out or so. It is the default and most objects work like this. I do not see any disadvantages here.

If you use want to show the diffs to end users you have very likely a lot more requirements. For example, you want to convert it to Html and perhaps you also need line number and column.

Do not disagree with that. Its always nice if it is easy for users to discover how to use an API without having to read docs.

Wrt. to custom/specialized diff types, I think they should be placed in the same folder as the comparer, next to the comparer itself, and only keep the generalized ones in the "/core/diffs/" folder.

I was thinking about this as well. BUT as a library user I like it when I do not have to add dozens of imports. E.g. I hate "Exceptions"-namespaces. The strategies namespaces contain implementation details and usually I do not need to know that. I think it would actually make more sense to move the diffs to the root namespace.

Since creating this library 3-4 years ago, I am now quite against the "namespace matching folder" that seem to be the default in .NET, at least for libraries. Everything that a normal user is expected to work with directly should just be in the root namespace, and other more specialized features could be under more specific namespaces.

egil commented 1 year ago

Closed by #36

egil commented 1 year ago

The implementation is available at https://www.nuget.org/packages/AngleSharp.Diffing/0.18.2-alpha-89. Give it a spin and let me know if things don't work as expected.