NuGet / Insights

Gather insights about public NuGet.org package data
Apache License 2.0
24 stars 7 forks source link

Consider using the Verify package for snapshot testing #100

Open joelverhagen opened 9 months ago

joelverhagen commented 9 months ago

I independently discovered snapshot testing and wrote hacky thing myself 😅.

https://github.com/VerifyTests/Verify - this might be better to use

joelverhagen commented 9 months ago

I like the workflow of Verify. It cleanly separates "received" and "verified" versions of the test output. There is rich tooling integration, such as a .NET tool to accept, reject, or skip unverified ("received") changes in the test output. I also like how it prescribes changes to .editorconfig and .gitattributes to make source control and editors work better with the verified files.

I also like the naming conventions of the snapshot files because they appear as child items of the test class.

Some things to figure out:

  1. Whether to CSV blobs directly (big strings) or as deserialized ICsvRecord instances. The former is a more strict approach but has a nasty diff (an existing problem with what I've implemented in Insights already). The latter loses some resolution on CSV serialization changes.
  2. Currently a lot of the assertions on snapshot content are in helper methods. These helper methods fail to detect the proper directory name because Verify uses [CallerFilePath] which needs to be the source file with the test itself, not the helper method source file.
  3. A lot of tests have multiple blobs that are verified. This is supported by Verify but it takes some extra work. UseTextForParameters can be used to have multiple test data files per test class. DisableRequireUniquePrefix can be used to re-verify some content against the snapshot a second time in the test. I haven't found drawbacks of these overrides but there must be a reason they are not default, especially DisableRequireUniquePrefix.
  4. If you have multiple Verify calls in one method, they should all be awaited at once when Task.WhenAll to all of the "received" files to be generated. If the previous assertions were serial, perhaps in a for loop, this will need to change to a loop producing tasks and then a Task.WhenAll after the loop. This is straight forward but still a lot of refactoring.
joelverhagen commented 9 months ago

Concern 1 is not a big deal IMO. There can be a couple strict tests about CSV serialization and the rest can be highly readable entity-based assertions.

Concern 2 is not a big deal at all. In my POC this was easy and also forced me to refactor test helpers making them clean in the end. Snapshot naming based off of [CallerFilePath] is superior to the manual naming we have now.

Concern 3 seems fine. I just DisableRequireUniquePrefix for every Verify invocation and it seems to have only upside and no downsides that I've seen.

Concern 4 is annoying but also not a big problem. It's easily mitigated with AutoVerify.

joelverhagen commented 9 months ago

I disagree with some of the defaults in the library such as launching the diff tool automatically (annoying for many test files). Mitigated with DiffRunner.Disabled = true.

I think the custom serialization of not-quite JSON is not right for the project. I dislike new serialization schemes so normal JSON can be used with VerifierSettings.UseStrictJson().

I dislike the default DateTimeOffset serialization but it appears to be configurable with TreatAsString.

Using eol=lf in .gitattributes felt right at first but this causes problems for string values that have CR LF since they are converted to LF thus changing the subject of the test. This can be fixed with binary or -text in the .gitattributes I think.

During the migration, I will assert on the entire CSV string instead of the deserialized CSV record POCOs so I can see the snapshots haven't changed. Later, we can switch to asserting on array of CSV POCO instead of a big nasty CSV string. This will make future diffs more readable.

joelverhagen commented 9 months ago

Since there are dozens of snapshot files for some test classes, I think I will put all of the test files in a TestData subdirectory much like today. This will keep the source tree a bit more organized IMO.

SimonCropp commented 9 months ago

Since there are dozens of snapshot files for some test classes, I think I will put all of the test files in a TestData subdirectory much like today. This will keep the source tree a bit more organized IMO.

My preference is to keep snapshots as close to the class that generates them. Note that Verify helps keep this use case clean by nesting snapshots under the file that generates them

image

SimonCropp commented 9 months ago

Whether to CSV blobs directly (big strings) or as deserialized ICsvRecord instances. The former is a more strict approach but has a nasty diff (an existing problem with what I've implemented in Insights already). The latter loses some resolution on CSV serialization changes.

there is the concept of Converters https://github.com/VerifyTests/Verify/blob/main/docs/converter.md so u can take any type, and convert it into as many parts as you want. so in the case of ICsvRecord, you could both write the whole csv and serialize the structure

SimonCropp commented 9 months ago

A lot of tests have multiple blobs that are verified. This is supported by Verify but it takes some extra work. UseTextForParameters can be used to have multiple test data files per test class. DisableRequireUniquePrefix can be used to re-verify some content against the snapshot a second time in the test. I haven't found drawbacks of these overrides but there must be a reason they are not default, especially DisableRequireUniquePrefix.

can u point me to one of these tests, and i can possible propose an alternative that does not require DisableRequireUniquePrefix

SimonCropp commented 9 months ago

if you start with a PR that adds Verify, i can help when u hit issues

joelverhagen commented 9 months ago

Since there are dozens of snapshot files for some test classes, I think I will put all of the test files in a TestData subdirectory much like today. This will keep the source tree a bit more organized IMO.

My preference is to keep snapshots as close to the class that generates them. Note that Verify helps keep this use case clean by nesting snapshots under the file that generates them

Haha, I keep going back and forth on which I like better. I think as long as there are not too many test classes or test methods in one directory then it's okay to keep the test files next to the test file. I'm just worried about a directory with 100s of files making the GitHub browser, ls, or Windows Files Explorer very crowded.

Whether to CSV blobs directly (big strings) or as deserialized ICsvRecord instances. The former is a more strict approach but has a nasty diff (an existing problem with what I've implemented in Insights already). The latter loses some resolution on CSV serialization changes.

there is the concept of Converters https://github.com/VerifyTests/Verify/blob/main/docs/converter.md so u can take any type, and convert it into as many parts as you want. so in the case of ICsvRecord, you could both write the whole csv and serialize the structure

This makes sense. I see Verify really as a tilt towards improving dev experience over what Insights has today. The test data diff we have today is pretty much impossible to read. Like there are CSV files with serialized JSON in some of the cells. This is impossible to read even in Beyond Compare which has CSV diff.

I prototyped a WriteOnlyJsonConverter which detected these cases and expanded CSV into and array of objects and the JSON CSV fields into potentially deep objects making the snapshot very readable (deep array of objects instead of an opaque CSV string). This is where I want to end up eventually after I've moved all of the tests over. I think a wise step 1 is to move all of the tests using the existing test data shape to avoid regressions.

I don't know if I want to keep the whole CSV or not... I like the idea of having that strict verification for all tests but it does mean any PR diffs that change or add CSVs have some pieces that are opaque (the CSV string diff) and other parts that are highly readable (the serialized object diff). This is probably the best of both worlds and we just as a team need to know that some parts of PR diffs will be hard to read and to just focus on the readable parts knowing that the unreadable parts are just there for strictness.

Still thinking about it. But I will definitely look into the Converter feature.

A lot of tests have multiple blobs that are verified. This is supported by Verify but it takes some extra work. UseTextForParameters can be used to have multiple test data files per test class. DisableRequireUniquePrefix can be used to re-verify some content against the snapshot a second time in the test. I haven't found drawbacks of these overrides but there must be a reason they are not default, especially DisableRequireUniquePrefix.

can u point me to one of these tests, and i can possible propose an alternative that does not require DisableRequireUniquePrefix

This is the existing test before introducing Verify. It asserts both the shape of some Azure Table Storage entities (intermediate state for NuGet Insights) and CSV (the final output of NuGet Insights) https://github.com/NuGet/Insights/blob/2c7e4636594a6fff21d7360e7cf4a083bc734e9c/test/Worker.Logic.Test/CatalogScan/Drivers/PackageCertificateToCsv/PackageCertificateToCsvIntegrationTest.cs#L19-L54

Description of the code under test: https://github.com/NuGet/Insights/blob/main/docs/drivers/PackageCertificateToCsv.md

Existing test data: https://github.com/NuGet/Insights/tree/main/test/Worker.Logic.Test/TestData/PackageCertificateToCsv

The CSVs are split by record type (T1, T2 -- referring to type parameters to the test suite and driver: one piece of code produces two CSVs per package it processes in this case). Each CSV type is also partitioned into buckets (compact_0.csv - compact_2.csv) for scalability reasons. Most tests set the bucket count to 3 to get partitioning testing but in PROD we have the bucket count to 1000 leaving CSVs that are several MB to several hundred MB instead of GBs.

if you start with a PR that adds Verify, i can help when u hit issues

Thank you for the offer! You've already been very helpful in this thread. Here's what I have so far: https://github.com/NuGet/Insights/pull/106 (draft).

It has the PackageCertificateToCsvIntegrationTest suite mentioned above moved over so you can see the before and after.

I currently forced Verify to encode test data the same way it already is but put the snapshot files in a new place. I am focusing on the changes to C# right now not the snapshot shape.

SimonCropp commented 9 months ago

Haha, I keep going back and forth on which I like better. I think as long as there are not too many test classes or test methods in one directory then it's okay to keep the test files next to the test file. I'm just worried about a directory with 100s of files making the GitHub browser, ls, or Windows Files Explorer very crowded.

yeah. in that scenario i move the test and its snapshots into their own directory to keep them together