aslushnikov / tracium

A blessed Chrome Trace parser.
Apache License 2.0
72 stars 7 forks source link

How to best handle "child cannot end after parent" errors? #8

Open soulgalore opened 5 years ago

soulgalore commented 5 years ago

Hi @aslushnikov when I test cnet.com sometimes my trace logs is broken with the child cannot end after parent error. I've pushed an example trace here: https://github.com/soulgalore/tracium/blob/failing-cnet/examples/cnet.json

I think this started to happen 1-2 Chrome releases ago. I guess it should be fixed upstream but I wonder if Tracium should just swallow the error and continue?

Best Peter

mbalabash commented 5 years ago

Hello I have same error (Fatal trace logic error - child cannot end after parent). But only when i run Tracium on CI server (Travis CI). On my personal laptop it doesn't happen.

patrickhulce commented 5 years ago

Our thinking was that if this has happened, all our assumptions about trace processing might not hold so it's better to alert the consumer than provide potentially false numbers.

Anything extra about the type of activity you're recording you can share?

mbalabash commented 5 years ago

Hi @patrickhulce Thanks for quick response.

I have next html file:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Estimo Template</title>
  </head>
  <body>
    <script src="https://unpkg.com/react@16/umd/react.development.js"></script>
    <h1>Estimo</h1>
  </body>
</html>

And next code to capture Chrome trace:

...
await page.tracing.start({ path: traceFile })
try {
  await page.goto(getUrlToHtmlFile(lib.html), { timeout: options.timeout })
} catch (err) {
  handleSessionError(err, browser)
}
await page.tracing.stop()
...

Then i'm trying to parse it with Tracium:

async function generateTasksReport(pathToTraceFile) {
  const content = JSON.parse(await readFile(pathToTraceFile))
  const tasks = tracium.computeMainThreadTasks(content, {
    flatten: true,
  })
  return tasks
}

On my local machine it works perfectly. But when i run it on CI it throws an error:

  /home/travis/build/mbalabash/estimo/node_modules/tracium/lib/main-thread-tasks.js:155
  Rejected promise returned by test. Reason:
  Error {
    message: 'Fatal trace logic error - child cannot end after parent',
  }
  Function._computeRecursiveSelfTime (node_modules/tracium/lib/main-thread-tasks.js:155:13)
  node_modules/tracium/lib/main-thread-tasks.js:159:37
  Function._computeRecursiveSelfTime (node_modules/tracium/lib/main-thread-tasks.js:159:8)
  node_modules/tracium/lib/main-thread-tasks.js:159:37
  Function._computeRecursiveSelfTime (node_modules/tracium/lib/main-thread-tasks.js:159:8)
  Function.getMainThreadTasks (node_modules/tracium/lib/main-thread-tasks.js:244:23)
  Function.compute (node_modules/tracium/lib/main-thread-tasks.js:272:28)
  Object.computeMainThreadTasks (node_modules/tracium/index.js:13:36)

This may be due to the fact that script refers to remote url?

soulgalore commented 5 years ago

From my side I think it's content related somehow? It happens sometimes when I test cnet.com, I test the same way all the time, but sometimes the trace fails. Since it's CNET I was thinking there's one of the ads or something that is acting up somehow?

patrickhulce commented 5 years ago

Hm interesting, thanks for the repro case @mbalabash.

@soulgalore are your failures also on a linux box or CI environment only?

soulgalore commented 5 years ago

@patrickhulce I could reproduce today on my Mac, else we are running in Docker on Linux (Ubuntu) where we also get it. I've also got reports from users that got the same error but on other pages, but they haven't shared their URLs. We used to use our own parsing lib but wanna move over to Tracium asap. And we got the same problem with events in the wrong order.

soulgalore commented 5 years ago

Had a go and tried to reproduce with puppeteer (I use Chromedriver normally). I didn't get the same error but the same behaviour that one out of three/four runs fails. The one failing are waiting a long time on some content (I'm running without any throttling) and the runs that are ok finish within a couple of seconds, but when it fails it takes 15-20s (maybe). And when it fails I get LHError: NO_TRACING_STARTED. So I think there's some content that somehow creates the problem?

const puppeteer = require('puppeteer');
const tracium = require('tracium');
const fs = require('fs');
const util = require('util');
const readFile = util.promisify(fs.readFile);

puppeteer.launch({ headless: false })
    .then(async (browser) => {
            const page = await browser.newPage();
            await page.setViewport({ width: 1024, height: 800 });
            await page.tracing.start({ path: 'trace.json' });
            await page.goto('http://www.cnet.com', { waitUntil: 'load', timeout: 0 });
            await page.tracing.stop();
            browser.close();
            const content = JSON.parse(await readFile('trace.json'));
            const tasks = tracium.computeMainThreadTasks(content, {
                    flatten: true,
             });
})
patrickhulce commented 5 years ago

Thanks @soulgalore! NO_TRACING_STARTED is a different problem that's particularly prevalent in m74+. We've got a fix for it in LH master but it hasn't been ported over here yet.

I'll have to dig in to those repro cases you've provided when I've got time. I haven't seen those cases locally before, but I'll try with your settings :)

mbalabash commented 5 years ago

Thanks! @patrickhulce I have small update for my case.

This may be due to the fact that script refers to remote url?

No. I got same error when used local scripts.

soulgalore commented 5 years ago

Yeah, I understand that it's another error but I thought I saw the same kind of behaviour. I've seen this on another URL (https://www.aftonbladet.se), it's a Swedish newspaper with a lot of ads. Let me see if I can find a way to reproduce.

mbalabash commented 5 years ago

Hi @patrickhulce Do you have any updates or some ideas?

patrickhulce commented 5 years ago

So far the only case where I've seen this happen where the trace isn't totally useless/non-sensical is when the ending trace event of the child that is B/E is missing and it was nested under an X.

Available options I see:

  1. Anytime we are missing an end event for a B/E pair, we just end it at the last logical place that would've made sense (the end of the parent for example). This is a bit aggressive and could lead to some misleading conclusions, but might be good enough.
  2. We omit any toplevel tasks (and all their children) that contained this nonsensical scenario.
  3. We only support the above case if it's the last event in the trace.
  4. We continue to fatally throw for a case of invalid data.

I tend to support option 4 in Lighthouse, but I see the value of 1/2 in the tracium case where other consumers might be ready to be more forgiving (maybe it's even a setting to pass in failLoudly: true or something?)

soulgalore commented 5 years ago

failLoudly: true

If that's ok for you all that would be great. I've been testing Tracium in different scenarios the last days and have gotten others errors too. I wonder if it could be because of that I use Chromedriver and somehow some events is dropped? I also do other scenarios, like testing a SPA.

I've got cases where one event don't get a duration/selfTime so that the https://github.com/aslushnikov/tracium/blob/master/lib/main-thread-tasks.js#L258 check fails.

Also got traces where I get end event without a start as in https://github.com/aslushnikov/tracium/blob/master/lib/main-thread-tasks.js#L111

Also using the debug log in those places would help.

patrickhulce commented 5 years ago

I wonder if it could be because of that I use Chromedriver and somehow some events is dropped? I also do other scenarios, like testing a SPA.

Interesting, yeah our trace processing up until tracium has always been cold load of page with raw protocol so it's possible these other scenarios increase the likelihood of these otherwise unlikely events from LH perspective.

@soulgalore if you could get us copies of any of these traces you're talking about I'd love to add some test cases and make sure we can handle them appropriately :)

soulgalore commented 5 years ago

I can collect a couple traces today/tomorrow. For SPA I'm injecting a nav start event for it to work. However I think the problem is on my side since I use the Chromedriver. Checking the docs it says "...will continue tracing until Chrome closes.". Without checking the code, I guess when we get the trace, the driver has not stopped the trace and it doesn't wait on Tracing.tracingComplete, so we probably don't get all the events. I'll try to switch to use CDP directly instead, that will minimise the risk (and work) for me :)

soulgalore commented 5 years ago

I made a quick POC and it works much better using CDP directly, sorry my fault, I'll try to push it to our test servers later today and then I will know for sure.

soulgalore commented 5 years ago

Ok been running these for a while and it works much better, so it was on my side! I've got one use case where Tracium sometimes breaks using a SPA. The use case is like this: Hit a menu item to open up a menu, wait for it to appear, start a trace, click on the new menu item and measure that request and load.

Sometimes (not always) I run into Fatal trace logic error - unexpected end event with an event that looks like this:

{ pid: 4034,
  tid: 4034,
  ts: 17974008372878,
  ph: 'E',
  cat: 'blink,devtools.timeline',
  name: 'HitTest',
  args:
   { endData:
      { x: 683,
        y: 282,
        rectilinear: true,
        move: true,
        nodeId: 95,
        nodeName: 'CANVAS class=\'flot-overlay\'' } },
  tts: 2163034 }

Maybe I can fix that on my side, not sure though. I can post the full trace log if it helps!

patrickhulce commented 5 years ago

Thanks @soulgalore that's good to hear many of the issues sorted themselves out!

wait for it to appear, start a trace, click on the new menu item and measure that request and load.

Ah I can see this scenario if we start a trace in the middle of some event and we only get the end part of it which isn't the case in normal LH operations.

I think turning this into a flag to drop these invalid cases (maybe stick the events into an array so the user can fail on their own if there were lots of them?) seems like the best way to go.

patrickhulce commented 5 years ago

Oh and a copy of that full trace would be awesome for test case prepping if you're willing :)

soulgalore commented 5 years ago

Sharing is caring: https://gist.github.com/soulgalore/6b5d13a485334afe009eab3866379e5f

mbalabash commented 5 years ago

Hello @patrickhulce @aslushnikov

I added docker support for reproducting Travis CI errors in estimo.

And it seems like strange chrome / puppeteer behavior. Test results are unstable. Same test for same container may pass few times and fails few times with error: Fatal trace logic error - child cannot end after parent

I run puppeteer like this:

const browser = await puppeteer.launch({ headless, args: ['--no-sandbox', '--disable-setuid-sandbox'] })

All code for work with puppeteer here.

Docker config

I'm still do stuff like this (to analyze only js parse/execution time):

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Estimo Template</title>
  </head>
  <body>
    <h1>Estimo</h1>
    <script src="https://unpkg.com/react@16/umd/react.development.js"></script>
  </body>
</html>

And i got errors only in Docker / CI (on my host machine all works perfectly). Unstable tests confusing me a little. Do you have any ideas?

Thanks

patrickhulce commented 5 years ago

Ah thanks so much for the docker repro @mbalabash! There's been movement on getting tracium publishable from within LH itself, so once that is done I'll work on getting a version published that can better handle these cases.

aslushnikov commented 5 years ago

I pushed the 0.2.1 version that sorts out some of the errors here that relate to trace event sorting. Unfortunately, these don't help with the original cnet trace there - I'll have to look into it separately.

soulgalore commented 5 years ago

Thanks @aslushnikov! Right now I'm using a modified version of Tracium, where we don't throw the errors. I've been continuously running Tracium and got a couple of more errors (that happens sometimes, not all the time). I get them all on a single navigation on "normal" pages.

I can revert and run them on another server just to collect the trace log. But I wonder if you don't also get them in Lighthouse? We run the same trace config and since I skipping enabling trace using Chromedriver and instead using CDP I'm 99.9% sure the trace log that we collect is correct in the sense it's what we expect out of Chrome.

patrickhulce commented 5 years ago

I've been digging more into these and there's a major root issue we've identified that Chrome's invariants for trace event ordering have drifted slightly from what was expected when the original code was written ~2 years ago. To be fair to our Chromium colleagues, there was a document that notes order could be different in the future, but it was difficult to find these bugs without any examples of how it could be different. Now that we have them it'll just take time to sort through :)

I'm working on some LH PRs that should hopefully help a lot with this class of error.