[RUM] traceId randomness is not guaranteed when page is crawled by googlebot

vigneshshanmugam commented 4 years ago

Traces from pages crawled by bots (confirmed Goolgebot) has same trace id generated every time a trace is generated. This results in issues on the APM UI on the Service maps and also would be error prone when it comes to making sampling decisions if we cannot guarantee the uniqueness of the id.

There is not much we can do from the Agent side to guarantee the generation of random id from the agent side as

The random generator utilities like Math.random and window.crypto.getRandomValues are over-riden by the bots and will always be same.
Even relying on current time like Date.now and Performance.now also posses some issues as timestamps are clamped by 9-10 seconds during the last test on Google search console.

While having this issue discussion, we came to a general agreement on identify the transactions from bots and providing a way to discard these transactions if users find these less helpful. An idea was to add a field to the transaction schema to identify its coming from the bot.

transaction.isBot = true
// or
transaction.origin = "bot"

APM UI will exclude these transactions by default to fix the performance issues in Service Maps. /cc @graphaelli

graphaelli commented 4 years ago

What is involved in the detection - is this something that can be done [optionally] in an ingest node processor?

vigneshshanmugam commented 4 years ago

What is involved in the detection - is this something that can be done [optionally] in an ingest node processor?

Not sure if the question was addressed to me. Just my thought process, can we do this as part of the User agent processor? Since the bot detection will be done at this point by looking at the user agent.

axw commented 4 years ago

If we're going to force-disable sampling for bots, I think we'll need to do that in the RUM agent. Otherwise sampling decisions will propagate downstream. We could do that with tail-sampling, but the earlier the better.

axw commented 4 years ago

Expanding on the above a bit. It seems to me it's actively unhelpful to have bots start traces if they always use the same trace ID, as this propagates downstream. We would end up with multiple distributed traces with the same trace ID, which is wrong and will break many built-in assumptions.

@graphaelli made a good point that we probably do still want to the capture traces on the backend, as it may be that bots are affecting the system performance.

My suggestion is that we identify bots in the RUM agent, and disable instrumentation: don't trace, and inject traceparent. Then we'll start the distributed trace on the backend. WDYT?

vigneshshanmugam commented 4 years ago

Agreed on the distributed traces part and its no so useful as the id is same all the time.

My suggestion is that we identify bots in the RUM agent, and disable instrumentation: don't trace, and inject trace-parent. Then we'll start the distributed trace on the backend. WDYT?

Taking a step back, the reason for moving the issue to Server was three fold and bigger than just Bot traffics in general.

Maintaining a list of bot names and ip address in the RUM agent is not ideal and would increase the agent code size and relying on the simple RegExp(/GoogleBot/) to exclude bots wont work in most cases as UA can be sniffed easily. Bots list - https://udger.com/resources/ua-list/crawlers
When RUM adds support for monitoring sessions, We will be left with the same issue and might end up with receiving same session and unique visitor count cannot be calculated in a correct way.
Easy on-boarding experience for RUM - Most of the client side monitoring tools provide a way to exclude Bot Traffic at the destination endpoint and also provides ways to drop Sessions that are originating from Bots, Ignored IP Addresses etc.

Having the solution at the server side provides ways in which the UI can be fine tuned by adding an Inbound Filter or similar section that allows users to drop/use these events. It would work also for anomaly detection.

Brought these points based on personal experience. May be @drewpost can also share some insights here.

axw commented 4 years ago

Maintaining a list of bot names and ip address in the RUM agent is not ideal and would increase the agent code size and relying on the simple RegExp(GoogleBot``) to exclude bots wont work in most cases as UA can be sniffed easily. Bots list - https://udger.com/resources/ua-list/crawlers

OK, but how many of those execute scripts? I was only thinking of the specific issue you raised around googlebot having a fake or fixed seed random number generator.

Easy on-boarding experience for RUM - Most of the client side monitoring tools provide a way to exclude Bot Traffic at the destination endpoint and also provides ways to drop Sessions that are originating from Bots, Ignored IP Addresses etc.

Having the solution at the server side provides ways in which the UI can be fine tuned by adding an Inbound Filter or similar section that allows users to drop/use these events.

Would you ever want bot traffic to show up in RUM? I presume not, since they're not Real Users.

I think I see where you're coming from now. However, I still think there needs to be some special handling in the agent for specific bots that execute scripts. Otherwise, as described above, the fake random number generator will cause invalid trace context to flow down the distributed trace. That's not something we can do efficiently in the server; once the trace context has left the agent, it becomes much more difficult, and will cause performance issues in the backend infrastructure (e.g. due to 100% sampling).

So I'll change my proposal a bit:

Identify bots which execute scripts (which cause reused trace IDs) in the RUM agent, and disable instrumentation: don't trace, and inject traceparent. Then we'll start the distributed trace on the backend.
The server will do more general bot detection for RUM transactions, and mark those transactions in some way.

vigneshshanmugam commented 4 years ago

OK, but how many of those execute scripts? I was only thinking of the specific issue you raised around googlebot having a fake or fixed seed random number generator.

I certainly don't have answers for that one. I just posted for reference that we cannot do much on the RUM side to eliminate the bot traffic completely. Googlebot was just one of the bots we found with the sample data when doing the performance investigation done by UI team.

Would you ever want bot traffic to show up in RUM? I presume not, since they're not Real Users.

It really depends, For identify unique session/visitor count obviously not. But for measuring page impressions, It might be useful.

Identify bots which execute scripts (which cause reused trace IDs) in the RUM agent, and disable instrumentation: don't trace, and inject traceparent. Then we'll start the distributed trace on the backend.

We could do certainly disable the agent on known cases using simple Regex.

The server will do more general bot detection for RUM transactions, and mark those transactions in some way.

Not sure if we need to do only for Bots, May be for crawlers in general (Headless browsers included).

Thanks Andrew, I liked the proposal.

vigneshshanmugam commented 4 years ago

The server will do more general bot detection for RUM transactions, and mark those transactions in some way.

I just stumbled upon this project today https://github.com/etienne-martin/device-detector-js which has a list of bots regex that we could use to detect in the server https://raw.githubusercontent.com/etienne-martin/device-detector-js/master/fixtures/regexes/bots.json for the first phase.

vigneshshanmugam commented 4 years ago

Closed by mistake. my bad

elastic / apm-server

[RUM] traceId randomness is not guaranteed when page is crawled by googlebot #3922