fforres commented 2 years ago

Hey Folks, With the shipping of https://github.com/DataDog/browser-sdk/pull/1601 the RUM SDK usage got a ton better for maintaining good observability and reliability on federated environments 🙏 I very much appreciate the work there.

Regarding that, I do have a question/use-case I want to run by you folks.

The context

While I've seen the same situation in some other big projects, I mostly biased by company's case, for a bit of context, we are using React, ATM one single codebase, SPA application, 200+ contributors and working towards a better federated story for it.

For the context of this thread, a "feature" can be considered an isolated component, providing some type of user experience, owned by a single team.

browser-rum` is a great way gather real user data on a page-bases, and afterwards being able to analyze user's behaviour in an application.

First case

Let's say we have a specific page is composed of "2 features", and those features are owned by 2 different teams.

If we want to track actions/errors/resources/etc but visualize them in the context of those features, things get a bit more complex. browser-rum does expose a set of building blocks that allows us, with some data massaging in datadogRum.init's beforeSend hook, to enrich the context of events that were triggered/executed on specific features.

In our case, I'm dynamically creating react contexts, and sending component information down the pipeline, in order to attach some "feature context" to events in the beforeSend hook.

And while the solution "works" in the sense that we are able to visualize RUM events with contextual information attached, it does not allow for good datadog integrations.

Second Case

Another example is our "navigation header". There's a lot that goes on in ours, queries contexts, a/b tests, feature flags, tracking, etc. The header itself is shown in almost of our pages, so for example surfacing any error in it, will show up in multiple pages. And while we are using the same React-Context abstractions we created to inject contextual information, it makes it very hard for the responsible frontend engineers to properly leverage DataDog (specially compared to how much backend engineers can leverage it).

The final part 😅

So the situation is that more and more, specially for teams on a federated frontend architecture, is that the concept of "the url" is less useful for consolidating application information.

Lately got questions thrown my way like:

Can we leverage services for showing our errors?
How can I setup SLOs for my frontend features?
My feature is used in 2 apps (think 2 different subdomains) how can I consolidate their information?

The actual questions

So, not to be the buzz killer with the PR enabling service version update. It's awesome and we'll look into leveraging right away :p

To the actual questions

First of all.. is our approach wrong? Should it be different/better?
Is there a preferred way to map events to a specific service, that do not depend on a view?
Are there suggestions/recommendations on how to structure reliability/observability for federated frontend applications?
- Like a "datadog seal of approval for X or Y approach", or "Do's and Dont's of RUM + federation"
- Will you folks consider that? 🙏
Have you folks considered framework-specific implementations that could make this easier for folks?
and finally... Is your roadmap public? 😅 Curious on what you folks are planning on next

fforres commented 2 years ago

I know it's a lot of questions and text 😅 Hoping this could at least serve as some sort of docs.

In any case, my thinking initially was that updating the service in the beforeSend's event argument would be enough. (It is marked as readonly though 😅 https://github.com/DataDog/browser-sdk/blob/7ab238958ad748e145b936cbb24a2d034ae046ba/packages/rum-core/src/rumEvent.types.ts#L685-L688)

But I might be missing something.

Thanks for the time and appreciate the work 🙏

re-thinking a bit about this. Do you folks mean "service" as another "RUM application" ? Then maybe my assumptions on the approach might be wrong.

Either way... the issue of the url being a not-that-useful abstractions for information consolidation might remain 😅

amortemousque commented 2 years ago

Hello @fforres

Thanks for your feedback!

1601 is here to help visualising RUM events happening in different area of ownership based on views.

We know that this approach has some limitations and does not allow isolating events based on components. This is something we have in mind, but has some tough technical challenges. One of them is being able to link resources, actions, errors and long task to a component.

Can you say more about how you create and link a "feature context" to events using beforeSend?

About your other questions:

First of all.. is our approach wrong? Should it be different/better? Is there a preferred way to map events to a specific service, that do not depend on a view?

Since we have no official support for isolating events based on component, using beforeSend to enrich events is a valid option.

Are there suggestions/recommendations on how to structure reliability/observability for federated frontend applications? Like a “datadog seal of approval for X or Y approach”, or “Do’s and Dont’s of RUM + federation”

A documentation is on its way.

Have you folks considered framework-specific implementations that could make this easier for folks?

The is something we have in mind, but not yet in the roadmap

I hope this will answer most of your questions :slightly_smiling_face:

felipetoffolo-toast commented 2 years ago

I have a similar issue. And I tried to override the service in the beforeSend too. That's what brought me here.

That approach could work for us since based on the file where the error occurred we would be able to identify the "service"

felipetoffolo-toast commented 2 years ago

Another possibility would be to be able to set the service when sending an error manually. In that way, I could filter the error inside of beforeSend, and send it manually with the new service.

BenoitZugmeyer commented 2 years ago

Hey @fforres and @felipetoffolo-toast,

We are a bit mitigated on letting users editing the service because this value is used a bit more broadly than users might expect, and editing it at the wrong time can potentially break the integration with other Datadog features.

In order to better understand your needs, we'd like a bit more information on your use case.

are your "services" executing on the same page completely different codebases? Or are they built and deployed at the same time (share the same version, maybe the same JS bundle)?
when using the event context attribute as a workaround, you are stating that “it does not allow for good datadog integrations.” Do you have specific examples in mind?

Do you folks mean "service" as another "RUM application" ?

Almost, yes. It's a bit more granular than that, as a RUM application can have multiple services. Services should have different codebase. If what you are trying to achieve is only applying team ownership on components within the same codebase, "service" is not the right tool, and using the event context should be better suited.

felipetoffolo-toast commented 2 years ago

Hey @BenoitZugmeyer

In our case, we have different "services" on the same page. We have a micro frontend structure. So a single page can have multiple codebases in there, with independent builds and deploys.

Trying to add a service in context did not work for me, I was not able to filter errors in the dashboards in that way.

My current approach as a test is to actually capture the error event myself, execute startView with the correct service name and then addError. That's why I mentioned that being able to override the service when calling addError would be helpful until we get a better solution.

If you want to understand better our use case we use https://single-spa.js.org/

DataDog / browser-sdk

Thoughts on allowing service override at beforeSend #1615

The context

First case

Second Case

The final part 😅

The actual questions

1601 is here to help visualising RUM events happening in different area of ownership based on views.