KnapsackPro / knapsack_pro-ruby

Knapsack Pro gem splits tests across parallel CI nodes and makes sure that tests run in optimal time
https://knapsackpro.com
MIT License
132 stars 27 forks source link

Simplecov doesnt work with Knapsack Pro 7.4 and minitest #257

Closed rafaeelaudibert closed 4 months ago

rafaeelaudibert commented 5 months ago

Hey!

I've been trying to update our Knapsack pro gem from version 6.0 to 7.4 but Im having problems with our minitest tests. We use SimpleCov alongside KnapsackPro but the output files are not being generated. There's no coverage folder anywhere after running the tests.

This did work with Knapsack 6.0.

It's worth noticing that minitest is properly generating its XML output, and we're merging them as we did before. The only change is that simplecov is not generating output anymore. I've confirmed Simplecov is being properly initialized, and we have the following snippet in place - we had it in place with 6.0 as well.

KnapsackPro::Hooks::Queue.before_queue do |_queue_id|
  SimpleCov.command_name("api_ci_node_#{KnapsackPro::Config::Env.ci_node_index}")
end

Any idea on what could be causing this?

3v0k4 commented 5 months ago

Hey, @rafaeelaudibert. I'm looking into it and will get back to you soon.

3v0k4 commented 5 months ago

I'm sorry for the troubles, Rafael. Unfortunately, I cannot reproduce the issue.

I can see a line my terminal output that says:

Coverage report generated for minitest_ci_node_0 to /path/to/project/coverage. 140 / 146 LOC (95.89%) covered

and the coverage folder is created and populated with the report.

Here are a few things that you could check:

  1. Are there any other code changes besides updating the knapsack_pro version?
  2. Does the test suite run rm -rf coverage or anything similar as part of running the tests? Maybe in a hook?
  3. Do you see the Coverage report generated for ... line in your CI node terminal output?
  4. Does Simplecov generate the report when you call Minitest directly (without Knapsack Pro)?
  5. Does running Knapsack Pro with KNAPSACK_PRO_LOG_LEVEL=debug log anything meaningful?
  6. Could you try to identify which specific version of Knapsack Pro in between 6.0.0 and 7.4.0 breaks Simplecov for you?

If the above doesn't help, could you please share the following? Feel free to send it to support@knapsackpro.com if there's anything sensitive.

Or if it's easier, I'd be happy to debug a sample application if you can share a repo.

ArturT commented 5 months ago

@rafaeelaudibert Do you use other queue hooks than KnapsackPro::Hooks::Queue.before_queue?

The only change for Minitest was a fix in 6.0.4.

The 7.x release does not introduce changes to Minitest.

Could you answer Riccardo's questions? That would help us narrow down the root issue. Thank you.

connorshea commented 5 months ago

at_exit is what emits the coverage (https://github.com/simplecov-ruby/simplecov?tab=readme-ov-file#customizing-exit-behaviour), would that have anything to do with the problem, if it's being removed?

I'm having some problems getting simplecov to work on my setup, but I've only tried 7.5.1 and haven't gotten simplecov to work with Knapsack thus far. I tried downgrading to 6.0.3 to see if that changed anything, but that caused it to fail with a VCR/WebMock-related error. I can provide more information once I dig into the problem a bit further.

connorshea commented 5 months ago

Ok, downgrading to 6.0.3 (after I fixed the VCR issue) does cause it to start emitting coverage files again! So I think the change in 6.0.4 needs to be rethought. Unfortunately, now it looks like one of the jobs is getting stuck due to the issue that 6.0.4 intended to fix :s Nevermind about that part, it does end up finishing on 6.0.3. It just got stuck for an extra few minutes with the last job in the set. So this fully works in 6.0.3, although Minitest seems to get stuck for a bit after the end of queue mode for the final job.

connorshea commented 5 months ago

Out of curiosity, does it matter if simplecov is required before knapsack_pro?

EDIT: It does not

connorshea commented 5 months ago

Are there any other code changes besides updating the knapsack_pro version? Nope, 6.0.3 works and 6.0.4 doesn't.

Does the test suite run rm -rf coverage or anything similar as part of running the tests? Maybe in a hook? Nope

Do you see the Coverage report generated for ... line in your CI node terminal output? Nope

Does Simplecov generate the report when you call Minitest directly (without Knapsack Pro)? Yep

Does running Knapsack Pro with KNAPSACK_PRO_LOG_LEVEL=debug log anything meaningful? I will try this next, but I assume it won't. I'm not seeing anything meaningful in the logs here.

Could you try to identify which specific version of Knapsack Pro in between 6.0.0 and 7.4.0 breaks Simplecov for you? It is definitely 6.0.4 that breaks the coverage output. No files are emitted, presumably because SimpleCov.result.format! is not called.

connorshea commented 5 months ago

On 7.5.1, if I add this to my test_helper.rb file, I get coverage reports again:

KnapsackPro::Hooks::Queue.after_queue do
  SimpleCov.result.format!
end

I think this fully fixes both the coverage report not generating and the problem that was fixed by 6.0.4?

EDIT: Sometimes the tests seem to get stuck at the very end of the process. I'm not sure why.

We have 20 parallel jobs, 18 or 19 of them all finish within a few seconds of each other, and then sometimes the last 1 or 2 just get stuck. Other times they all pass fine at around the exact same time, and nothing gets stuck.

connorshea commented 5 months ago

The issue mentioned above seems to have something to do with the last job/jobs taking a lot longer than the other ones. Something about the logic for the queue mode seems to break or trigger a race condition when there's still one more job running tests while all the others have completed. At least, that seems to be the cause. And it's the only one I've come up with that really makes sense, given that the issue is inconsistent.

Unfortunately, I don't have much to offer beyond that.

ArturT commented 5 months ago

Thanks @connorshea for detailed research. This is helpful.

On 7.5.1, if I add this to my test_helper.rb file, I get coverage reports again:

KnapsackPro::Hooks::Queue.after_queue do
  SimpleCov.result.format!
end

I think this fully fixes both the coverage report not generating and the problem that was fixed by 6.0.4?

It seems reasonable. Good findings.

EDIT: Sometimes the tests seem to get stuck at the very end of the process. I'm not sure why.

We have 20 parallel jobs, 18 or 19 of them all finish within a few seconds of each other, and then sometimes the last 1 or 2 just get stuck. Other times they all pass fine at around the exact same time, and nothing gets stuck.

You can debug the free with the following instructions: https://docs.knapsackpro.com/ruby/troubleshooting/#knapsack-pro-hangs

Add the following script to test_helper.rb. Thanks to that, you will see the backtrace showing what code line causes the hanging problem. Backtrace is visible only if the Ruby process got kill with USR1 signal. Learn more from this GitHub issue.

# source https://github.com/rspec/rspec-rails/issues/1353#issuecomment-93173691
puts "Process pid: #{Process.pid}"

trap 'USR1' do
  threads = Thread.list

  puts
  puts "=" * 80
  puts "Received USR1 signal; printing all #{threads.count} thread backtraces."

  threads.each do |thr|
    description = thr == Thread.main ? "Main thread" : thr.inspect
    puts
    puts "#{description} backtrace: "
    puts thr.backtrace.join("\n")
  end

  puts "=" * 80
end

If the CI provider does not kill the frozen process, you can do it by running the below command in the terminal (when you are ssh-ed into the CI node):

kill -USR1 <process pid>

Alternatively, you can use the timeout program to send the USR1 signal after Knapsack runs too long.

timeout --signal=USR1 600 bundle exec rake "knapsack_pro:queue:rspec[--format d]"

Notice that 600 is in seconds. It sends the signal after 10 minutes of running the command. You may want to adjust that number to ensure the USR1 signal is sent after the process is stuck and not before.

ArturT commented 5 months ago

@rafaeelaudibert Does adding the following help in your case?

KnapsackPro::Hooks::Queue.after_queue do
  SimpleCov.result.format!
end
ArturT commented 5 months ago

I've updated the changelog for the 6.0.4 version to cover breaking changes affecting simplecov users. https://github.com/KnapsackPro/knapsack_pro-ruby/blob/master/CHANGELOG.md#604

I also updated our simplecov docs: https://docs.knapsackpro.com/ruby/simplecov/

and CodeClimate + simplecov docs: https://docs.knapsackpro.com/ruby/code-climate/

ArturT commented 5 months ago

@connorshea I think the issue with hanging tests is unrelated to the original simplecov issue.

We could keep working on resolving hanging tests in the private support email thread that's been going on.

I propose we close this GitHub issue as the original problem is solved.

connorshea commented 5 months ago

I've updated the changelog for the 6.0.4 version to cover breaking changes affecting simplecov users. https://github.com/KnapsackPro/knapsack_pro-ruby/blob/update-changelog-simplecov/CHANGELOG.md

I also updated our simplecov docs: https://docs.knapsackpro.com/ruby/simplecov/

and CodeClimate + simplecov docs: https://docs.knapsackpro.com/ruby/code-climate/

That first link is broken now since the branch was merged

@connorshea I think the issue with hanging tests is unrelated to the original simplecov issue.

We could keep working on resolving hanging tests in the private support email thread that's been going on.

I propose we close this GitHub issue as the original problem is solved.

That's fair! I'm fine with us closing this one unless Rafael needs any further help.

connorshea commented 5 months ago

I'll try the debugging steps you mentioned a bit later, I think switching to Regular Mode and forgoing code coverage has solved the issue we were having for now, regardless.

ArturT commented 4 months ago

That first link is broken now since the branch was merged

@connorshea Good catch!

I've updated my comment and the link.