Open dotnet-bot opened 2 years ago
The dotnet/arcade-services repository has been very stable and received a very low number of changes during the .NET 6 and .NET 7 cycles. However, with the new .NET 8 charter, it has become a point of interest for many efforts such as improved mirroring or prodcon (VMR). Unfortunately, there is some debt accrued which makes it risky to start making changes in the repo again. Specifically, these points of interest are:
DarcLib – A library that contains a lot of useful code for managing git repositories (local, AzDO, GitHub). This code is re-used between the release repo, darc CLI and the Maestro service. It is now also being re-used by the VMR effort.
Maestro – A service that has a lot of complex interactions with external world (GitHub, AzDO) with which there are associated quite complex E2E scenario tests.
Maestro Scenario Tests – A valuable set of E2E tests that validate our ability to integrate with other services (GitHub, AzDO). They test quite complex flows involving creation, updates and merging of dependency flow PRs.
Bottom line is that it’s a repo that will be a shared ground for Redmond / Prague teams for some time going forward. Both teams need to be able to understand its status, service it and be comfortable making changes. These future workstreams might also give us a window of opportunity to justify fixing long-standing issues.
The debt in DarcLib stems from strong coupling of seemingly unrelated things (e.g., git management and Build Asset Registry clients) and poor separation of concerns which makes it hard to re-use. The code also has low test coverage stemming from a poor testability of the large components (e.g., IRemote). Poor testability makes some simple fixes very hard to verify by unit tests (example issue when the whole team tried to fix PR description link during their FR and failed to finish on time).
Together with some debt in DarcLib, non-exceptional exceptions make local dev debugging, reading logs and investigating staging/production failures difficult. Any unexpected problems in dependency flow needs someone from @dnceng (and even then, it’s mostly Matt M and Ricardo) to look at logs. We had automation that opened issues, but those weren’t customer friendly, and we should surface problems such that customers can self-serve. Furthermore, since DarcLib is not unit-testable/covered, small fixes / changes require running the service often. It is time consuming to sometimes fix simple things.
The tests are slow and unreliable in staging. This has caused the official pipeline to have a pass rate of 30% in the last 14 days. The Maestro scenario tests caused 90% of the failures, with all recent failures being explained by flakiness.
The tests also take a long time to run, as they execute sequentially and take about 80 minutes in total. This unreliability/flakiness also costs extra cycles whenever we release Maestro (even when we don’t really make updates) as it requires investigation into why the staging build is not green. Our monitoring has opened 80 issues since April regarding failed builds which needed to be investigated, only to determine that the root cause for most of them were the flaky tests. The investigation requires tribal knowledge of the service and the tests themselves which mandates specific people (Ricardo) to determine whether a release is safe. Maestro hasn’t seen a lot of changes other than small bug fixes and edge cases, but with the upcoming work we need to make sure these tests are still valuable.
Easier DarcLib code re-use to leverage code for talking to GH/AzDO from newer projects like Build Analysis, VMR or mirroring.
Higher DarcLib unit-test coverage would increase turnaround for small fixes / changes and could be handled even non-dnceng teams => easier to pick up and fix bugs, higher velocity for adding new features.
Better democratization and less overhead during deployments since it shouldn’t take 5+ reruns of main builds and some tribal knowledge to make sure we’re ready to roll out. Better devops practices, no deploy on red, ...
Aligning the repository with our team's guidance for testing (https://github.com/dotnet/arcade/blob/main/Documentation/Validation/README.md). Specifically, this means:
Attempt to increase the code coverage of unit tests to as close to 80% as possible (Currently at 30.3%)
Adding pre-deployment checks to check necessary infrastructure is in place (Azure resources and storage accounts, etc)
Making the post-deployment tests exclusively validate end to end scenarios, along with checking the integration with external dependencies (Github, Azure DevOps, Azure)
Eliminate overlapping end to end tests. As a result of the low code coverage for unit tests, the repo currently has 21 scenario tests, some of which only differ in small ways. As the coverage of unit tests improves it should be possible to reduce the number of scenario tests we need to run.
Adjusting the repository structure to align with the guidance.
Exception-less Maestro service logs unless there are actual problems
Stable tests that provide confidence for deployments, the goal should be that the end to end tests should only fail if there's a problem with the code under test, or when there are unrecoverable problems in one of the external dependencies.
Migrated from https://github.com/dotnet/core-eng/issues/10280
@ChadNedzlek wrote:
From core-eng/10210:
Right now the "arcade-services" repository publishes 5 big groups of things:
All 5 of these have wildly different servicing and distribution needs, having all of them in one repository is causing us to keep pulling it's arcade usage in different directions, and recently we sort of broke it with all the pulling, and now arcade is more or less dead in the water.
We need to re-evaluate the design of these 5 sets of things, and make sure they are in the right repository and flowing correctly.
Right now there are several large problems:
Monster Objects/Assemblies
DarcLib.dll has a Monster Objects, or, rather, Monster Interface, in "IRemote". It's bound a lot of our DLL's into very odd dependency spirals. At the very least, it needs to be split up. And if there is any shared logic, it needs it's own types, so that the client and server can share the logic with types that make sense. This has to happen first, because right now it's bound everything together. Once it's broken up, so we can think about each piece needs individually, we can make progress.
Testability
The IRemote class makes most testing very costly, because so many parts of the system reference this god object... to write any sort of test, you need an implementation of this, which is a huge chunk of functionality. Any test that calls code that references this interface needs detailed knowledge about which interface methods methods it's going to call, because it's not feasible to mock all 50+ of them, and most of them may (or may not, depending on implementation) make network calls, so it's not possible to use the real object either. This makes the test expensive to write, because you need to write a new mock for every test, and you need to inspect every code path to see what methods you need to mock. It also makes the test fragile, because anyone could add a new call at any point, and the test will just fail. Which means they are high cost, low value tests, which are the worst kind. But it's all that can be done in the current state.
Distribution
There are several interesting dependencies that are hidden inside IRemote. It has a dependency on the EF database itself in some places because of some implementations. It has a dependency on the generated client (which then has a dependency on the server existing to provide the web methods). It has dependencies on file formats through the "Versions" stuff. These are all bound together in one assembly, which means trying to decide how to service this assembly is difficult, because all of these dependencies have different lifecycles (file versions need to be able to read backward, the database needs to stay in lock step with the EF, and the web versions can stay locked on one version forever).
Distribution
The service and the service shared DLL's have wildly different release and support needs than the build task, which are different than darc.exe which is different than darclib. Arcade can't really handle things in a single repository having different release patterns, so it's not sustainable for all these to live here. It feels like the build tasks need to go to arcade, since that's where the rest of our build tasks live. Darc.exe and DarcLib.dll are strange beasts, and need serious consideration about what support and versioning for them really means. My gut says darc.exe goes to arcade... it's a build tool. And once the Monster Assembly is broken up, it might be easier to understand what it's support and distribution means, which things make more sense in arcade, and what makes sense in arcade-services.
arcade dependency
Once we solve those problems, and the service is just a service, it's possible the arcade dependency doesn't make sense anymore. Arcade is tightly focused on shipping signed and verified binaries with complicated release channels. The service isn't that. It's a single channel with no special dependencies that can easily be handled with of the shelf tools (dotnet build, dotnet test, nuget publish, nuget restore), and the arcade complication is just making that more complicated.
Recently Triaged Issues
All issues in this section should be triaged by the v-team into one of their business objectives or features.