dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA followed by process killed / return 137 #6978

Open ericstj opened 5 months ago

ericstj commented 5 months ago

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=530980&view=results Build error leg or test failing: Microsoft.ML.TorchSharp.Tests Work Item Pull Request https://github.com/dotnet/machinelearning/pull/6976

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": [ "Starting test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA", "+ export _commandExitCode=137" ],
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

System Information (please complete the following information):

Describe the bug This test is failing in CI somewhat regularly. The error pattern looks like the following:

Starting test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA
Killed
+ export _commandExitCode=137

Here are a few instances: https://helixre107v0xd1eu3ibi6ka.blob.core.windows.net/dotnet-machinelearning-refs-pull-6974-merge-f61a125156aa4af1bd/Microsoft.ML.TorchSharp.Tests/1/console.83a6fa6c.log?helixlogtype=result https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-machinelearning-refs-pull-6976-merge-0a13c2cd41724c3483/Microsoft.ML.TorchSharp.Tests/1/console.ff57f777.log?helixlogtype=result

I can't currently capture this failure in a known issue because there is no unique line logged. I've seen this failure numerous times - always when TestSimpleQA is running.

Report

Build Definition Test Pull Request
712523 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution dotnet/machinelearning#7179
702472 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution dotnet/machinelearning#7165

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 2

Known issue validation

Build: :mag_right: Result validation: :warning: Build internal information not found. This may happen if your build is too old. Please use a build that is no older than two weeks. If the problem persists, contact .NET Engineering Services Team and share this issue. Validation performed at: 2/14/2024 10:25:46 PM UTC

ericstj commented 5 months ago

@michaelgsharp made a good observation offline - we're seeing memory usage go up quite a bit as the tests progress.

Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarity with memory usage 2,077,020,160.00 and max memory usage 2,370,473,984.00

That's using 2GB memory after the previous test completed.

ericstj commented 5 months ago

Wow - the memory usage of this test is very high. Here's what I see from a local passing run on Windows.

  Discovering: Microsoft.ML.TorchSharp.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  Microsoft.ML.TorchSharp.Tests (found 12 test cases)
  Starting:    Microsoft.ML.TorchSharp.Tests (parallel test collections = on [20 threads], stop on fail = off)
Starting test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNer
Finished test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNer with memory usage 751,607,808.00 and max memory usage 751,607,808.00
Starting test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNerOptions
    Microsoft.ML.TorchSharp.Tests.NerTests.TestNERLargeFileGpu [SKIP]
      Needs to be on a comp with GPU or will take a LONG time.
Finished test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNerOptions with memory usage 895,778,816.00 and max memory usage 895,778,816.00
Starting test: Microsoft.ML.TorchSharp.Tests.ObjectDetectionTests.SimpleObjDetectionTest
total : 171, filtered: 0, filter ratio: 0.00%
Finished test: Microsoft.ML.TorchSharp.Tests.ObjectDetectionTests.SimpleObjDetectionTest with memory usage 1,142,628,352.00 and max memory usage 1,155,977,216.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence3Classes
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence3Classes with memory usage 1,111,171,072.00 and max memory usage 1,155,977,216.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestDoubleSentence2Classes
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestDoubleSentence2Classes with memory usage 1,352,704,000.00 and max memory usage 1,352,818,688.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence2Classes
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence2Classes with memory usage 1,365,450,752.00 and max memory usage 1,366,872,064.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarity
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarity with memory usage 1,362,817,024.00 and max memory usage 1,368,600,576.00
    Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarityLargeFileGpu [SKIP]
      Needs to be on a comp with GPU or will take a LONG time.
    Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestTextClassificationWithBigDataOnGpu [SKIP]
      Condition(s) not met: "EnableRunningGpuTest"
Starting test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA
Finished test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA with memory usage 4,675,801,088.00 and max memory usage 5,540,958,208.00
    Microsoft.ML.TorchSharp.Tests.QATests.TestQALargeFileGpu [SKIP]
      Needs to be on a comp with GPU or will take a LONG time.
  Finished:    Microsoft.ML.TorchSharp.Tests

So we may have some leak (this still shows growth) but we also are using a ton of memory when running this test.