dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

SymLinksMayExistIndependentlyOfTarget failure #65119

Open danmoseley opened 2 years ago

danmoseley commented 2 years ago

Frequency:

Day Run Notes

4/1-7/22 7/12 | Official run | Win10 7/6 | PR #71263 | Win10 6/3 | PR #69885 | Win10

let failedTests = (methodName : string, includePR : bool, messageSubstr: string, includePassedOnRerun : bool) {
cluster('engsrvprod.kusto.windows.net').database('engineeringdata').AzureDevOpsTests
    //| where TestName startswith "System.Net" 
    | where TestName contains methodName
    | where includePassedOnRerun or (Outcome == 'Failed')
    | extend startOfTestName = indexof_regex(TestName, @"[^.]+$")
    | extend Method = substring(TestName, startOfTestName)
    | extend Type = substring(TestName, 0, startOfTestName - 1)
    | project-away startOfTestName
    | where Method == methodName
    | where Message contains messageSubstr
    | distinct JobId, WorkItemId, Message, StackTrace, Method, Type, Arguments, Outcome
    | join kind=inner (cluster('engsrvprod.kusto.windows.net').database('engineeringdata').Jobs
        | where ((Branch == 'refs/heads/main') or (Branch == 'refs/heads/master') or (includePR and (Source startswith "pr/")))
        | where Type startswith "test/functional/cli/"
            and not(Properties contains "runtime-staging")
        | summarize arg_max(Finished, Properties, Type, Branch, Source, Started, QueueName) by JobId
        | project-rename JobType = Type) on JobId
    | extend PropertiesJson = parse_json(Properties)
    | extend OS = replace_regex(tostring(PropertiesJson.operatingSystem), @'\((.*)\).*|([^\(].*)', @'\1\2')
    | extend Runtime = iif(PropertiesJson.runtimeFlavor == "mono", "Mono", iif(PropertiesJson.DefinitionName contains "coreclr", "CoreCLR", ""))
    | extend TargetBranch = extractjson("$.['System.PullRequest.TargetBranch']", Properties)
    | extend Architecture = PropertiesJson.architecture
    | extend Scenario = iif(isempty(PropertiesJson.scenario), "--", PropertiesJson.scenario)
    //| extend DefinitionName = PropertiesJson.DefinitionName
    | project-away PropertiesJson
};
failedTests('SymLinksMayExistIndependentlyOfTarget', true, '', true);

Runfo query for last 60 days (incl. PRs) with less hits than Kusto (from unknown reason)

ystem.IO.Tests.File_Exists.SymLinksMayExistIndependentlyOfTarget [FAIL]
      linkPath should no longer exist
      Expected: False
      Actual:   True
      Stack Trace:
        /_/src/libraries/System.IO.FileSystem/tests/File/Exists.cs(138,0): at System.IO.Tests.File_Exists.SymLinksMayExistIndependentlyOfTarget()

'System.IO.FileSystem.Tests' from job 3683fd05-6451-4dba-848b-64f9d459eac3 workitem 46af4fdd-dcb6-4ead-9081-44d83202df31 (windows.10.amd64.open.rt) executed on machine a009Z5S

hit in #64677

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

Issue Details
``` ystem.IO.Tests.File_Exists.SymLinksMayExistIndependentlyOfTarget [FAIL] linkPath should no longer exist Expected: False Actual: True Stack Trace: /_/src/libraries/System.IO.FileSystem/tests/File/Exists.cs(138,0): at System.IO.Tests.File_Exists.SymLinksMayExistIndependentlyOfTarget() ``` 'System.IO.FileSystem.Tests' from job 3683fd05-6451-4dba-848b-64f9d459eac3 workitem 46af4fdd-dcb6-4ead-9081-44d83202df31 (windows.10.amd64.open.rt) executed on machine a009Z5S hit in #64677
Author: danmoseley
Assignees: -
Labels: `area-System.IO`
Milestone: -
danmoseley commented 2 years ago
            // Now delete the symlink.
            File.Delete(linkPath);
            Assert.False(File.Exists(linkPath), "linkPath should no longer exist");

odd

jozkee commented 2 years ago

I was thinking on name collision but that can't be the case for this test, this is how path and linkPath look like:

  path: C:\Users\david\AppData\Local\Temp\File_Exists_rvyyngfh.y1s\SymLinksMayExistIndependentlyOfTarget_121_9d65b352
  linkPath: C:\Users\david\AppData\Local\Temp\File_Exists_rvyyngfh.y1s\GetRandomLinkName_84_57004724.link

  path: C:\Users\david\AppData\Local\Temp\PathFile_Exists_vlbp50lx.sjy\SymLinksMayExistIndependentlyOfTarget_121_fe1a95b8
  linkPath: C:\Users\david\AppData\Local\Temp\PathFile_Exists_vlbp50lx.sjy\GetRandomLinkName_84_6cbea16a.link

Even though GetRandomLinkName_84 persist, we also use GetType() + Path.GetRandomFileName() and that should avoid the collision.

danmoseley commented 2 years ago

The way Windows handles file deletes has been a bit unintuitive to me in the past, and I believe changed at one point. @JeremyKuhne may have thoughts.

danmoseley commented 2 years ago

Another possibility perhaps is that the delete failed for some reason and we ate the error somehow

jozkee commented 2 years ago

FWIW the only error we eat on Delete is File Not Found. https://github.com/dotnet/runtime/blob/ccd67b0c449e11cb6fbfea75507b6d01c2182b2d/src/libraries/System.Private.CoreLib/src/System/IO/FileSystem.Windows.cs#L99-L106

JeremyKuhne commented 2 years ago

The way Windows handles file deletes has been a bit unintuitive to me in the past, and I believe changed at one point

For NTFS the name would not be freed for use until all file handles were closed. It changed in 1903- we have tests that check this behavior:

https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/System.IO.FileSystem/tests/FileStream/ctor_str_fm_fa_fs.delete.cs#L115-L119

danmoseley commented 2 years ago

Maybe this is pre-1903 and it was a virus checker or something.

jozkee commented 2 years ago

For NTFS the name would not be freed for use until all file handles were closed.

@JeremyKuhne does this imply that File.Exists(path) will return true if not all handles to that file are yet closed? Cause that's what is happening here, File.Exists() is returning true right after the file was deleted.

@danmoseley CI says windows.10.amd64.open.rt, matching that to the list in https://helix.dot.net/ (ctrl+f) shows:

Operating System Windows Server 2016-Datacenter Architecture AMD64

I don't know how to check if that is pre-1903 cc @dotnet/area-infrastructure-libraries

JeremyKuhne commented 2 years ago

will return true if not all handles to that file are yet closed?

I believe so. And 2016 is way before 1903 (2019 March).

jozkee commented 2 years ago

@JeremyKuhne, thanks. Will close this issue as it is non actionable for us IMO.

danmoseley commented 2 years ago

@MattGal does anything like a virus checker run on these Server 2016 machines? Something that could grab random handles and potentially defeat deleting things.

MattGal commented 2 years ago

@MattGal does anything like a virus checker run on these Server 2016 machines? Something that could grab random handles and potentially defeat deleting things.

Any SKU that's effectively Windows 8 or newer has anti-virus stuff that you can't defeat even if it's a throwaway machine only used for testing. More recently, we've started allowing (because we absolutely must and it's unfortunately not up for debate... would that it were) Azure Security Pack and Geneva Monitoring onto the machines. These are services that can try to reconfigure antivirus as well as perform their own scans on the machine. When you combine this with enough executions, the behavior you're suspecting is pretty darn likely to be the case. See https://github.com/dotnet/core-eng/issues/15161 for more details.

One thing to note though, is that people who would run this test locally are also likely to have some form of antivirus turned on, and hardening the tests to both understand and tolerate this is likely the only long term solution that keeps us off automatic-scanning radar.

danmoseley commented 2 years ago

Thanks that makes it clear. OK if this happens again I'll put some retry around it. Right now there's zero evidence that this line is more likely to need a retry than any of the other places.

jtschuster commented 2 years ago

Hit again in rolling build 1873896 on net7.0-windows-Release-x86-CoreCLR_checked-Windows.10.Amd64.Open Log

danmoseley commented 2 years ago

@Jozkee what do you recommend here? this is blocking-clean-ci so we either need to fix, workaround or disable.

jozkee commented 2 years ago

A retry should be the best. This kind of failure is not exclusive to this unit test. The same happens on this other one that exercises the same assert:

https://dataexplorer.azure.com/clusters/engsrvprod/databases/engineeringdata?query=H4sIAAAAAAAAA22PsU7DYAyE9zzFqUth6guQraoUBAtUYv6JT4lJaqPfLgWJhydqJMTAZN3d9w0+MvKJcZ4z0HzjzdUwqUmrZqx48Tp1yVPA7Td08h96769XarmdLPtlZCUOahojBXctzC83t81uV0zwyIgyEL1bFrXAhh+0hKgsXML7/lxRAvx8Z5+UTbNqObr8sfacmWrD89fpQW3aO8PyWvJY6sBcvfVFtC22h6Lz9gfcOQWS9wAAAA==

JobId WorkItemId EventId JobName WorkItemName WorkItemFriendlyName Type Method ArgumentHash Arguments Result Duration Exception Message StackTrace Traits Reason Attempt AzurePipelinesTestRunId AzurePipelinesTestResultId JobId1 WorkItemId1 JobName1 Name FriendlyName Status PassCount FailCount SkipCount WarnCount Queued Started Finished ExitCode ConsoleUri MachineName WorkItemType Uri PassOnRetryCount Attempt1 QueueName LogUploadSeconds LogSizeBytes JobId2 Name1 Source Type1 Build Creator System IsExternal PullRequestId Queued1 Started1 Finished1 TestsPass TestsFail TestsSkip ItemsPass ItemsWarning ItemsBadExit ItemsFail ItemsError ItemsTimeout InitialItems TotalItems JobList Properties QueueName1 ItemsNotRun TestsPassedOnRetry ItemsPassedOnRetry Attempt2 ItemsDeadLetter QueueAlias DockerTag TeamProject Repository Branch
19906508 974759625 de30fe19-f93c-4d00-9023-25a9e2979c01 0070981a-3424-4773-b1c7-2675326f385c System.IO.FileSystem.Tests System.IO.Tests.DirectoryInfo_Delete DeletingSymLinkDoesntDeleteTarget Fail 0.077 linkPath should no longer exist Expected: False Actual: True at System.IO.Tests.Directory_Deletestr.DeletingSymLinkDoesntDeleteTarget() in //src/libraries/System.IO.FileSystem/tests/Directory/Delete.cs:line 121 at System.RuntimeMethodHandle.InvokeMethod(Object target, Void* arguments, Signature sig, Boolean isConstructor) at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr args, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodInvoker.cs:line 81 48061842 171875 19906508 974759625 de30fe19-f93c-4d00-9023-25a9e2979c01 0070981a-3424-4773-b1c7-2675326f385c System.IO.FileSystem.Tests Pass 0 0 0 0 2022-06-02T23:24:30.597Z 2022-06-02T23:41:31.987Z 2022-06-02T23:44:00.819Z 0 https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-70180-merge-de30fe19f93c4d0090/System.IO.FileSystem.Tests/1/console.77514c80.log?helixlogtype=result a0005EE other 0 1 windows.10.amd64.open.rt 0 0 19906508 de30fe19-f93c-4d00-9023-25a9e2979c01 pr/public/dotnet/runtime/refs/pull/70180/merge test/functional/cli/innerloop/ 27 true 2022-06-02T23:24:30.488Z 2022-06-02T23:24:30.519Z 2022-06-02T23:49:23.744Z 96092 1 1259 278 0 0 0 0 0 277 278 https://helixde107v0xdeko0k025g8.blob.core.windows.net/helix-job-2efdb516-920f-4186-8036-0b6aeb199ff777aff45fee040748a/job-list-ed8ed7fa-1ef2-483f-9d39-55ac6b530110.json?helixlogtype=result {"operatingSystem":"Windows.10.Amd64.Open","System.JobId":"215a953c-1e4b-5cc0-95e2-5484aabbe659","configuration":"Debug","runtimeFlavor":"coreclr","DefinitionId":"686","System.JobName":"__default","architecture":"x86","DefinitionName":"runtime","System.StageAttempt":"1","BuildNumber":"20220602.105","System.PullRequest.TargetBranch":"main","Reason":"PullRequest","BuildId":"1803799","System.PhaseAttempt":"1","AzurePipelinesTestRunId":"48061842","Project":"public","System.JobAttempt":"1","CollectionUri":"https://dev.azure.com/dnceng/","System.PhaseName":"libraries_test_run_checked_coreclr_windows_x86_Debug","System.StageName":"__default"} windows.10.amd64.open.rt 0 0 0 0 windows.10.amd64.open public dotnet/runtime refs/pull/70180/merge
karelz commented 2 years ago

Only 1 hit in last 60 days (as of 7/22). Only 3 hits since 4/1 in Kusto -- see frequency in top post. Removing blocking-clean-ci label.