dotnet / arcade-services

Arcade Engineering Services
MIT License
61 stars 74 forks source link

Pile of tech-debt for arcade-services #2481

Open dotnet-bot opened 2 years ago

dotnet-bot commented 2 years ago

Migrated from https://github.com/dotnet/core-eng/issues/10280

@ChadNedzlek wrote:

From core-eng/10210:

Right now the "arcade-services" repository publishes 5 big groups of things:

All 5 of these have wildly different servicing and distribution needs, having all of them in one repository is causing us to keep pulling it's arcade usage in different directions, and recently we sort of broke it with all the pulling, and now arcade is more or less dead in the water.

We need to re-evaluate the design of these 5 sets of things, and make sure they are in the right repository and flowing correctly.

Right now there are several large problems:

Monster Objects/Assemblies

DarcLib.dll has a Monster Objects, or, rather, Monster Interface, in "IRemote". It's bound a lot of our DLL's into very odd dependency spirals. At the very least, it needs to be split up. And if there is any shared logic, it needs it's own types, so that the client and server can share the logic with types that make sense. This has to happen first, because right now it's bound everything together. Once it's broken up, so we can think about each piece needs individually, we can make progress.

Testability

The IRemote class makes most testing very costly, because so many parts of the system reference this god object... to write any sort of test, you need an implementation of this, which is a huge chunk of functionality. Any test that calls code that references this interface needs detailed knowledge about which interface methods methods it's going to call, because it's not feasible to mock all 50+ of them, and most of them may (or may not, depending on implementation) make network calls, so it's not possible to use the real object either. This makes the test expensive to write, because you need to write a new mock for every test, and you need to inspect every code path to see what methods you need to mock. It also makes the test fragile, because anyone could add a new call at any point, and the test will just fail. Which means they are high cost, low value tests, which are the worst kind. But it's all that can be done in the current state.

Distribution

There are several interesting dependencies that are hidden inside IRemote. It has a dependency on the EF database itself in some places because of some implementations. It has a dependency on the generated client (which then has a dependency on the server existing to provide the web methods). It has dependencies on file formats through the "Versions" stuff. These are all bound together in one assembly, which means trying to decide how to service this assembly is difficult, because all of these dependencies have different lifecycles (file versions need to be able to read backward, the database needs to stay in lock step with the EF, and the web versions can stay locked on one version forever).

Distribution

The service and the service shared DLL's have wildly different release and support needs than the build task, which are different than darc.exe which is different than darclib. Arcade can't really handle things in a single repository having different release patterns, so it's not sustainable for all these to live here. It feels like the build tasks need to go to arcade, since that's where the rest of our build tasks live. Darc.exe and DarcLib.dll are strange beasts, and need serious consideration about what support and versioning for them really means. My gut says darc.exe goes to arcade... it's a build tool. And once the Monster Assembly is broken up, it might be easier to understand what it's support and distribution means, which things make more sense in arcade, and what makes sense in arcade-services.

arcade dependency

Once we solve those problems, and the service is just a service, it's possible the arcade dependency doesn't make sense anymore. Arcade is tightly focused on shipping signed and verified binaries with complicated release channels. The service isn't that. It's a single channel with no special dependencies that can easily be handled with of the shelf tools (dotnet build, dotnet test, nuget publish, nuget restore), and the arcade complication is just making that more complicated.

Recently Triaged Issues

All issues in this section should be triaged by the v-team into one of their business objectives or features.

riarenas commented 2 years ago

Context and motivation

The dotnet/arcade-services repository has been very stable and received a very low number of changes during the .NET 6 and .NET 7 cycles. However, with the new .NET 8 charter, it has become a point of interest for many efforts such as improved mirroring or prodcon (VMR). Unfortunately, there is some debt accrued which makes it risky to start making changes in the repo again. Specifically, these points of interest are:

Bottom line is that it’s a repo that will be a shared ground for Redmond / Prague teams for some time going forward. Both teams need to be able to understand its status, service it and be comfortable making changes. These future workstreams might also give us a window of opportunity to justify fixing long-standing issues.

Existing debt

DarcLib

The debt in DarcLib stems from strong coupling of seemingly unrelated things (e.g., git management and Build Asset Registry clients) and poor separation of concerns which makes it hard to re-use. The code also has low test coverage stemming from a poor testability of the large components (e.g., IRemote). Poor testability makes some simple fixes very hard to verify by unit tests (example issue when the whole team tried to fix PR description link during their FR and failed to finish on time).

Maestro

Together with some debt in DarcLib, non-exceptional exceptions make local dev debugging, reading logs and investigating staging/production failures difficult. Any unexpected problems in dependency flow needs someone from @dnceng (and even then, it’s mostly Matt M and Ricardo) to look at logs. We had automation that opened issues, but those weren’t customer friendly, and we should surface problems such that customers can self-serve. Furthermore, since DarcLib is not unit-testable/covered, small fixes / changes require running the service often. It is time consuming to sometimes fix simple things.

Maestro Scenario Tests

The tests are slow and unreliable in staging. This has caused the official pipeline to have a pass rate of 30% in the last 14 days. The Maestro scenario tests caused 90% of the failures, with all recent failures being explained by flakiness.

The tests also take a long time to run, as they execute sequentially and take about 80 minutes in total. This unreliability/flakiness also costs extra cycles whenever we release Maestro (even when we don’t really make updates) as it requires investigation into why the staging build is not green. Our monitoring has opened 80 issues since April regarding failed builds which needed to be investigated, only to determine that the root cause for most of them were the flaky tests. The investigation requires tribal knowledge of the service and the tests themselves which mandates specific people (Ricardo) to determine whether a release is safe. Maestro hasn’t seen a lot of changes other than small bug fixes and edge cases, but with the upcoming work we need to make sure these tests are still valuable.

Value proposition

Measuring success

Cost estimation