Open soulgalore opened 5 years ago
Hello
I have same error (Fatal trace logic error - child cannot end after parent
). But only when i run Tracium
on CI server (Travis CI).
On my personal laptop it doesn't happen.
Our thinking was that if this has happened, all our assumptions about trace processing might not hold so it's better to alert the consumer than provide potentially false numbers.
Anything extra about the type of activity you're recording you can share?
Hi @patrickhulce Thanks for quick response.
I have next html file:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Estimo Template</title>
</head>
<body>
<script src="https://unpkg.com/react@16/umd/react.development.js"></script>
<h1>Estimo</h1>
</body>
</html>
And next code to capture Chrome trace:
...
await page.tracing.start({ path: traceFile })
try {
await page.goto(getUrlToHtmlFile(lib.html), { timeout: options.timeout })
} catch (err) {
handleSessionError(err, browser)
}
await page.tracing.stop()
...
Then i'm trying to parse it with Tracium
:
async function generateTasksReport(pathToTraceFile) {
const content = JSON.parse(await readFile(pathToTraceFile))
const tasks = tracium.computeMainThreadTasks(content, {
flatten: true,
})
return tasks
}
On my local machine it works perfectly. But when i run it on CI it throws an error:
/home/travis/build/mbalabash/estimo/node_modules/tracium/lib/main-thread-tasks.js:155
Rejected promise returned by test. Reason:
Error {
message: 'Fatal trace logic error - child cannot end after parent',
}
Function._computeRecursiveSelfTime (node_modules/tracium/lib/main-thread-tasks.js:155:13)
node_modules/tracium/lib/main-thread-tasks.js:159:37
Function._computeRecursiveSelfTime (node_modules/tracium/lib/main-thread-tasks.js:159:8)
node_modules/tracium/lib/main-thread-tasks.js:159:37
Function._computeRecursiveSelfTime (node_modules/tracium/lib/main-thread-tasks.js:159:8)
Function.getMainThreadTasks (node_modules/tracium/lib/main-thread-tasks.js:244:23)
Function.compute (node_modules/tracium/lib/main-thread-tasks.js:272:28)
Object.computeMainThreadTasks (node_modules/tracium/index.js:13:36)
This may be due to the fact that script
refers to remote url?
From my side I think it's content related somehow? It happens sometimes when I test cnet.com, I test the same way all the time, but sometimes the trace fails. Since it's CNET I was thinking there's one of the ads or something that is acting up somehow?
Hm interesting, thanks for the repro case @mbalabash.
@soulgalore are your failures also on a linux box or CI environment only?
@patrickhulce I could reproduce today on my Mac, else we are running in Docker on Linux (Ubuntu) where we also get it. I've also got reports from users that got the same error but on other pages, but they haven't shared their URLs. We used to use our own parsing lib but wanna move over to Tracium asap. And we got the same problem with events in the wrong order.
Had a go and tried to reproduce with puppeteer (I use Chromedriver normally). I didn't get the same error but the same behaviour that one out of three/four runs fails. The one failing are waiting a long time on some content (I'm running without any throttling) and the runs that are ok finish within a couple of seconds, but when it fails it takes 15-20s (maybe). And when it fails I get LHError: NO_TRACING_STARTED. So I think there's some content that somehow creates the problem?
const puppeteer = require('puppeteer');
const tracium = require('tracium');
const fs = require('fs');
const util = require('util');
const readFile = util.promisify(fs.readFile);
puppeteer.launch({ headless: false })
.then(async (browser) => {
const page = await browser.newPage();
await page.setViewport({ width: 1024, height: 800 });
await page.tracing.start({ path: 'trace.json' });
await page.goto('http://www.cnet.com', { waitUntil: 'load', timeout: 0 });
await page.tracing.stop();
browser.close();
const content = JSON.parse(await readFile('trace.json'));
const tasks = tracium.computeMainThreadTasks(content, {
flatten: true,
});
})
Thanks @soulgalore! NO_TRACING_STARTED is a different problem that's particularly prevalent in m74+. We've got a fix for it in LH master but it hasn't been ported over here yet.
I'll have to dig in to those repro cases you've provided when I've got time. I haven't seen those cases locally before, but I'll try with your settings :)
Thanks! @patrickhulce I have small update for my case.
This may be due to the fact that script refers to remote url?
No. I got same error when used local scripts.
Yeah, I understand that it's another error but I thought I saw the same kind of behaviour. I've seen this on another URL (https://www.aftonbladet.se), it's a Swedish newspaper with a lot of ads. Let me see if I can find a way to reproduce.
Hi @patrickhulce Do you have any updates or some ideas?
So far the only case where I've seen this happen where the trace isn't totally useless/non-sensical is when the ending trace event of the child that is B/E is missing and it was nested under an X.
Available options I see:
I tend to support option 4 in Lighthouse, but I see the value of 1/2 in the tracium case where other consumers might be ready to be more forgiving (maybe it's even a setting to pass in failLoudly: true
or something?)
failLoudly: true
If that's ok for you all that would be great. I've been testing Tracium in different scenarios the last days and have gotten others errors too. I wonder if it could be because of that I use Chromedriver and somehow some events is dropped? I also do other scenarios, like testing a SPA.
I've got cases where one event don't get a duration/selfTime so that the https://github.com/aslushnikov/tracium/blob/master/lib/main-thread-tasks.js#L258 check fails.
Also got traces where I get end event without a start as in https://github.com/aslushnikov/tracium/blob/master/lib/main-thread-tasks.js#L111
Also using the debug log in those places would help.
I wonder if it could be because of that I use Chromedriver and somehow some events is dropped? I also do other scenarios, like testing a SPA.
Interesting, yeah our trace processing up until tracium has always been cold load of page with raw protocol so it's possible these other scenarios increase the likelihood of these otherwise unlikely events from LH perspective.
@soulgalore if you could get us copies of any of these traces you're talking about I'd love to add some test cases and make sure we can handle them appropriately :)
I can collect a couple traces today/tomorrow. For SPA I'm injecting a nav start event for it to work. However I think the problem is on my side since I use the Chromedriver. Checking the docs it says "...will continue tracing until Chrome closes.". Without checking the code, I guess when we get the trace, the driver has not stopped the trace and it doesn't wait on Tracing.tracingComplete, so we probably don't get all the events. I'll try to switch to use CDP directly instead, that will minimise the risk (and work) for me :)
I made a quick POC and it works much better using CDP directly, sorry my fault, I'll try to push it to our test servers later today and then I will know for sure.
Ok been running these for a while and it works much better, so it was on my side! I've got one use case where Tracium sometimes breaks using a SPA. The use case is like this: Hit a menu item to open up a menu, wait for it to appear, start a trace, click on the new menu item and measure that request and load.
Sometimes (not always) I run into Fatal trace logic error - unexpected end event with an event that looks like this:
{ pid: 4034,
tid: 4034,
ts: 17974008372878,
ph: 'E',
cat: 'blink,devtools.timeline',
name: 'HitTest',
args:
{ endData:
{ x: 683,
y: 282,
rectilinear: true,
move: true,
nodeId: 95,
nodeName: 'CANVAS class=\'flot-overlay\'' } },
tts: 2163034 }
Maybe I can fix that on my side, not sure though. I can post the full trace log if it helps!
Thanks @soulgalore that's good to hear many of the issues sorted themselves out!
wait for it to appear, start a trace, click on the new menu item and measure that request and load.
Ah I can see this scenario if we start a trace in the middle of some event and we only get the end part of it which isn't the case in normal LH operations.
I think turning this into a flag to drop these invalid cases (maybe stick the events into an array so the user can fail on their own if there were lots of them?) seems like the best way to go.
Oh and a copy of that full trace would be awesome for test case prepping if you're willing :)
Sharing is caring: https://gist.github.com/soulgalore/6b5d13a485334afe009eab3866379e5f
Hello @patrickhulce @aslushnikov
I added docker support for reproducting Travis CI errors in estimo.
And it seems like strange chrome / puppeteer behavior.
Test results are unstable. Same test for same container may pass few times and fails few times with error: Fatal trace logic error - child cannot end after parent
I run puppeteer like this:
const browser = await puppeteer.launch({ headless, args: ['--no-sandbox', '--disable-setuid-sandbox'] })
All code for work with puppeteer here.
I'm still do stuff like this (to analyze only js parse/execution time):
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Estimo Template</title>
</head>
<body>
<h1>Estimo</h1>
<script src="https://unpkg.com/react@16/umd/react.development.js"></script>
</body>
</html>
And i got errors only in Docker / CI (on my host machine all works perfectly). Unstable tests confusing me a little. Do you have any ideas?
Thanks
Ah thanks so much for the docker repro @mbalabash! There's been movement on getting tracium publishable from within LH itself, so once that is done I'll work on getting a version published that can better handle these cases.
I pushed the 0.2.1 version that sorts out some of the errors here that relate to trace event sorting. Unfortunately, these don't help with the original cnet trace there - I'll have to look into it separately.
Thanks @aslushnikov! Right now I'm using a modified version of Tracium, where we don't throw the errors. I've been continuously running Tracium and got a couple of more errors (that happens sometimes, not all the time). I get them all on a single navigation on "normal" pages.
I can revert and run them on another server just to collect the trace log. But I wonder if you don't also get them in Lighthouse? We run the same trace config and since I skipping enabling trace using Chromedriver and instead using CDP I'm 99.9% sure the trace log that we collect is correct in the sense it's what we expect out of Chrome.
I've been digging more into these and there's a major root issue we've identified that Chrome's invariants for trace event ordering have drifted slightly from what was expected when the original code was written ~2 years ago. To be fair to our Chromium colleagues, there was a document that notes order could be different in the future, but it was difficult to find these bugs without any examples of how it could be different. Now that we have them it'll just take time to sort through :)
I'm working on some LH PRs that should hopefully help a lot with this class of error.
Hi @aslushnikov when I test cnet.com sometimes my trace logs is broken with the child cannot end after parent error. I've pushed an example trace here: https://github.com/soulgalore/tracium/blob/failing-cnet/examples/cnet.json
I think this started to happen 1-2 Chrome releases ago. I guess it should be fixed upstream but I wonder if Tracium should just swallow the error and continue?
Best Peter