axboe / fio

Flexible I/O Tester
GNU General Public License v2.0
5.26k stars 1.26k forks source link

Running on Windows with more than 1 processor group #527

Closed adityagulavani closed 6 years ago

adityagulavani commented 6 years ago

Is there a way to build/run Fio for Windows to support more than 1 processor group (system with more than 64 logical cores).

On Linux, I am facing no issues at all scaling the workload across the 2 sockets. But, on Windows, since it is using Processor Groups for scheduling the process, the workload is running on a single socket at any given time. This is gating me from finishing the thread scaling study.

How do I run Fio on Windows with numjobs>64

sitsofe commented 6 years ago

@adityagulavani

Looking at https://msdn.microsoft.com/en-us/library/windows/desktop/dd405503(v=vs.85).aspx , shows you'll probably have to develop a patch for this one - https://github.com/axboe/fio/blob/52fd65f47e7ba1ba346c53a4f31eb8b4f2024e92/os/os-windows.h#L175 presently lacks the knowledge.

One approach might be to check no bits that are outside of the same 8 bit range have been set together (and if so bail out). You could then teach fio to always set the processor group to the (mask cpu / 64) and the processor within that group of 64 to (mask cpu >> (8 * mask cpu / 64)). Those values could then be used with SetThreadGroupAffinity. Thoughts?

sitsofe commented 6 years ago

Also note that using SetThreadGroupAffinity will make the fio binary only usable on Windows 7/Windows 2008 R2 and above.

axboe commented 6 years ago

@sitsofe Probably we can make that configurable by using some configure argument to build for older versions. It seems much more important to support > 64 CPUs, as that's nothing special these days.

sitsofe commented 6 years ago

It looks like going beyond one processor group will only "work" if the user is willing to do some sort of manual assignment of threads. Here's my understanding:

I don't think there's anything sensible that fio can safely do with multiple processor groups by itself so the user will have to set some options if they want to use more than one processor group. If a processor_group=<int> option were added then aside from the monotonic clock test only using only one processor group, all the cpu* options could just be said to be relative to the processor group specified and a warning printed to use processor groups if the user tries to set a CPU above 64 the number of cores in that processor group.

adityagulavani commented 6 years ago

if the user tries to set a CPU above 64

should it be

if the user tries to set a CPU above number of cores in a processor group for that system.

sitsofe commented 6 years ago

@adityagulavani OK I've got a stab at this over on https://github.com/sitsofe/fio/tree/proc_group . It works by concatenating the processor groups together from the perspective of fio. This was done because it's not easy to pass sideband information to the CPU affinity functions (and there are some options like gtod_cpu, log_compression_cpus etc which would need to grow some way of handling processor groups.

Usage: Imagine you have two processor groups set out like so:

  1. Processor group zero: 40 processors
  2. Processor group one: 32 processors

You would set cpus_allowed=0-39 to use all of the first processor group and cpus_allowed=40-71 to access the CPUs in the second processor group. A thread/job can only be in one processor group so cpus_allowed=38-41 would not be allowed in this example. Obviously different jobs/threads can be in different process groups. If no CPU setting is done then everything should land in the same single process group.

The patch is raw as I no longer have access to a Windows servers let alone a server with more than 64 CPUs but it would be good to get some feedback. Run with --debug=process for extra debugging output.

adityagulavani commented 6 years ago

@sitsofe Sorry for the delay. I'll update this thread on Monday 2/20.

sitsofe commented 6 years ago

@adityagulavani were you able to test anything?

adityagulavani commented 6 years ago

Hi,

I'm sorry for the delay again. The system had been running other workloads.

I tested the binaries and looks like we still have some issues with the cpus_allowed flag. I have a dual socket system with 52 HT threads each. Each processor group has 52 cores.

Compiled with CygWin64 (as mentioned on fio git page)

when I run it as fio.exe --name=job --filename=\\.\H\: --rw=randread --bs=4k --direct=1 --invalidate=1 --ioengine=windowsaio --iodepth=1024 --numjobs=64 which as expected ran all threads on 1 processor group. I also see this warning (/error) printed 64 times

CPU mask contains CPUs from different processor groups - aborting
clock setaffinity failed: No error

when I add --cpus_allowed=0-51 (or 0- where <52 for that matter) not all the cores in processor group 1 are used. The 1st 12 cores are used (running at 100%), and the rest are idling.

when I add --cpus_allowed=52-* or anything above 52, I get the error below and fio aborts

CPU mask contains CPUs from different processor groups - aborting
clock setaffinity failed: No error

There might be a bug in calculating the cpu mask or assigning thread affinity.

sitsofe commented 6 years ago

@adityagulavani Thanks! Back to the drawing board...

sitsofe commented 6 years ago

@adityagulavani Any chance you could attach the output when running your job with --debug=process --numjobs=1 --cpus_allowed=53 ?

adityagulavani commented 6 years ago

Done.

Command in file. fio_debug.log

sitsofe commented 6 years ago

@adityagulavani OK I've pushed a number of updates with more debugging. Any better?

adityagulavani commented 6 years ago

Perfect. This works just neat.

I've tested

When single job sets cpus_allowed across both processor groups, fio throws error.

I had a couple of more questions.

  1. Is there a way to schedule single job with #numjobs > #cores_on_processor_group? Will some more code changes allow a single job to run on both (multiple) processor groups? Maybe spawning threads alternately on processor groups.

  2. Have we tested the performance impact of fio running a single job (--numjobs=N) vs fio running 2 jobs(--numjobs=N/2) or multiple jobs (--numjobs=N/M) ?

Edit: grammar

sitsofe commented 6 years ago

OK so ends the "fun" part. Cleaning that commit up in such a way it can be merged may take longer...

  1. cpus_allowed_policy=split (see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-cpus-allowed-policy ) with a cpus_allowed that spans multiple processor groups may do what you want.

  2. If it's the same fio run I can't imagine it will make any difference as to whether you use numjobs or whether you use multiple "job sections" but I'll leave someone else to prove that :-)

adityagulavani commented 6 years ago

Yep. works perfectly as expected.

Thanks a lot @sitsofe @axboe

sitsofe commented 6 years ago

@adityagulavani

I have a cleaned up branch over on https://github.com/sitsofe/fio/tree/proc_group . Could you check that it still works for you with the tests you did previously and could you cast your eye over some of the documentation changes (e.g. https://github.com/sitsofe/fio/commit/b570e037a5bb8abf780b75e896a8dee336c2dd74 ) and let me know if everything looks OK?

adityagulavani commented 6 years ago

Hi @sitsofe ,

I'll need some time to get my hands on the system along with my setup on it. It's currently running a different set of workloads. Will try to update you as soon as possible. Hopefully by next week.

Thanks,

sitsofe commented 6 years ago

@adityagulavani Just a quick check to see if you're going to have a chance to check on this in the next day. If not I'll just submit and you can check afterwards ;-)

adityagulavani commented 6 years ago

@sitsofe : I've the access on Thursday. I'm planning to run it then, you can submit this in case it is too late. I'll definitely update this thread on Thursday EOD. Thanks much :)

adityagulavani commented 6 years ago

Hi @sitsofe,

The code works fine just as before. I tested the latest code with the runs I had previously done:

  1. numjobs > (#cores on proc group) with cpus_allowed set to cores on each processor group once (multiple runs)
  2. numjobs = total cores on system with cpus_allowed spanning across both groups and cpus_allowed_policy=split
  3. numjobs = cores on each processor groups with multiple jobs (each job has different cpus_allowed parameter)
  4. random tests with cpus_allowed set to anywhere in the middle of the core distribution.

The workload is able to scale across both processor groups and is able to saturate all the cores.

It would be interesting to test this on a system that has asymmetrical distribution of cores on the processor groups. I dont have such system. Though I hope the code will take care of that condition as well.

The documentation is great as well (I've mentioned an inline comment in the documentation at the end of .. option:: cpus_allowed=str)

Thanks a lot :)

axboe commented 6 years ago

Thanks for re-testing. Sitsofe, I went over your branch about a week ago, and great work. I've pulled it in.

sitsofe commented 6 years ago

@adityagulavani @axboe Thanks for pushing this through (I'm just back from holiday so it's nice to hear it's mostly sorted :-) )

It would be interesting to test this on a system that has asymmetrical distribution of cores on the processor groups. I dont have such system. Though I hope the code will take care of that condition as well.

I don't have such a system either but there's nothing in the code that should assume equally sized groups. Theoretically you can force Windows to make processor groups it wouldn't do normally by using BCDEdit but I haven't gone that far myself.

@adityagulavani I'll see what I can do with respect to your comment - perhaps something like:

When using cpus_allowed_policy=split CPUs can be from different processor groups because each job is only allocated one of the specified CPUs.

jliang000 commented 3 years ago

Hi we are running fio on Windows systems with more than 64 processors. With the latest fio version 3.27. we still get error message like: "fio_getaffinity: pid 1792 is associated with 2 process groups." The io proceed fine afterwards. Should we change log to warning instead of stderr output? or it really change fio expectation on the io test on Windows? thanks

axboe commented 3 years ago

We should probably just use the latest group, if there are more than 1. Does it work if you apply the below?

diff --git a/os/windows/cpu-affinity.c b/os/windows/cpu-affinity.c
index 7601970fc7c2..49c0aa869d0f 100644
--- a/os/windows/cpu-affinity.c
+++ b/os/windows/cpu-affinity.c
@@ -253,7 +253,7 @@ int fio_getaffinity(int pid, os_cpu_mask_t *mask)
            __func__, pid, GetLastError());
        goto err;
    }
-   if (group_count > 1) {
+   if (0 && group_count > 1) {
        log_err("%s: pid %d is associated with %d process groups\n",
            __func__, pid, group_count);
        goto err;
sitsofe commented 3 years ago

Hmm this error message means that that fio found it was a multi-group process while trying to work out the process affinity (i.e. the process itself is in a different processor group to at least one of its threads). From https://docs.microsoft.com/en-us/windows/win32/procthread/processor-groups :

If a thread is assigned to a different group than the process, the process's affinity is updated to include the thread's affinity and the process becomes a multi-group process.

You can also see a warning over on https://github.com/axboe/fio/blob/6202c70d8d5cbdd3fb4bc23b96f691cbd25a327e/os/windows/cpu-affinity.c#L217-L219 because I presumed Windows would start off a process in only one group. From the Processor Groups docs:

By default, an application is constrained to a single group, which should provide ample processing capability for the typical application. The operating system initially assigns each process to a single group in a round-robin manner across the groups in the system. A process begins its execution assigned to one group. The first thread of a process initially runs in the group to which the process is assigned. Each newly created thread is assigned to the same group as the thread that created it.

@jliang000 what options/job parameters are you running with?

@axboe to do what you're describing you would have to unset processors in the mask that were for groups other than the latest one before the mask is returned. I'm still curious as to how to process became multi-group before any thread affinities were set - do you know if get fio_getaffinity() is sometimes called after fio_setaffinity()?

jliang000 commented 3 years ago

@axboe @sitsofe thanks for the response. we can easily reproduce the by fio.exe --version. it will print out the same error message. I did not found this issue on older windows version with same number of cpu cores. it seems introduced by windows 2022 on the behavior of thread affinity with multiple processor groups (more than 64 cores). I know there is normal on linux to setaffinity different threads on different numa nodes for performance reason. If we ignore this error message by log_info, instead of log_err, could it impact the performance test as expected on other windows systems?

NEO-AMiGA commented 3 months ago

what's the status of this one? Should it be fixed? Just ran into it today.

fio.exe -version fio_getaffinity: pid 11772 is associated with 2 process groups fio-3.37

axboe commented 3 months ago

Honestly I have no idea. I don't have any windows boxes at all, so can't really test it myself. Happy to take patches if there's still an issue there...

NEO-AMiGA commented 3 months ago

sadly i don't think I can help with anything other than testing and providing output. Maybe it's more of a warning then something broken. The cpuclock-test passes. 🤔

.\fio.exe --cpuclock-test
fio_getaffinity: pid 21708 is associated with 2 process groups
cs: reliable_tsc: yes
time     3620  cycles[0]=2099979
time     3620  cycles[1]=2099889
time     3620  cycles[2]=2099950
time     3620  cycles[3]=2099907
time     3620  cycles[4]=2099915
time     3620  cycles[5]=2099918
time     3620  cycles[6]=2099903
time     3620  cycles[7]=2099935
time     3620  cycles[8]=2099940
time     3620  cycles[9]=2099935
time     3620  cycles[10]=2099920
time     3620  cycles[11]=2099912
time     3620  cycles[12]=2099957
time     3620  cycles[13]=2099904
time     3620  cycles[14]=2099925
time     3620  cycles[15]=2099915
time     3620  cycles[16]=2099921
time     3620  cycles[17]=2099946
time     3620  cycles[18]=2099940
time     3620  cycles[19]=2099904
time     3620  cycles[20]=2099923
time     3620  cycles[21]=2099904
time     3620  cycles[22]=2099964
time     3620  cycles[23]=2099923
time     3620  cycles[24]=2099928
time     3620  cycles[25]=2099928
time     3620  cycles[26]=2099904
time     3620  cycles[27]=2099928
time     3620  cycles[28]=2099960
time     3620  cycles[29]=2099915
time     3620  cycles[30]=2099940
time     3620  cycles[31]=2099917
time     3620  cycles[32]=2099951
time     3620  cycles[33]=2099871
time     3620  cycles[34]=2099917
time     3620  cycles[35]=2099896
time     3620  cycles[36]=2099931
time     3620  cycles[37]=2099875
time     3620  cycles[38]=2099901
time     3620  cycles[39]=2099935
time     3620  cycles[40]=2099857
time     3620  cycles[41]=2099934
time     3620  cycles[42]=2099942
time     3620  cycles[43]=2099942
time     3620  cycles[44]=2099932
time     3620  cycles[45]=2099935
time     3620  cycles[46]=2099935
time     3620  cycles[47]=2099934
time     3620  cycles[48]=2099931
time     3620  cycles[49]=2099921
time     3620  min=2099857, max=2099979, mean=2099923.780000, S=0.464693, N=50
time     3620  trimmed mean=2099924, N=39
time     3620  max_ticks=7559726400000, __builtin_clzll=21, max_mult=2440133
time     3620  tmp=2562046, sft=1
time     3620  tmp=1281023, sft=2
time     3620  tmp=640511, sft=3
time     3620  tmp=320255, sft=4
time     3620  tmp=160127, sft=5
time     3620  tmp=80063, sft=6
time     3620  tmp=40031, sft=7
time     3620  tmp=20015, sft=8
time     3620  tmp=10007, sft=9
time     3620  tmp=5003, sft=10
time     3620  tmp=2501, sft=11
time     3620  tmp=1250, sft=12
time     3620  tmp=625, sft=13
time     3620  tmp=312, sft=14
time     3620  tmp=156, sft=15
time     3620  tmp=78, sft=16
time     3620  tmp=39, sft=17
time     3620  tmp=19, sft=18
time     3620  tmp=9, sft=19
time     3620  tmp=4, sft=20
time     3620  tmp=2, sft=21
time     3620  tmp=1, sft=22
time     3620  clock_shift=22, clock_mult=1997359
time     3620  tmp=7559726400000, max_cycles_shift=0
time     3620  tmp=3779863200000, max_cycles_shift=1
time     3620  tmp=1889931600000, max_cycles_shift=2
time     3620  tmp=944965800000, max_cycles_shift=3
time     3620  tmp=472482900000, max_cycles_shift=4
time     3620  tmp=236241450000, max_cycles_shift=5
time     3620  tmp=118120725000, max_cycles_shift=6
time     3620  tmp=59060362500, max_cycles_shift=7
time     3620  tmp=29530181250, max_cycles_shift=8
time     3620  tmp=14765090625, max_cycles_shift=9
time     3620  tmp=7382545312, max_cycles_shift=10
time     3620  tmp=3691272656, max_cycles_shift=11
time     3620  tmp=1845636328, max_cycles_shift=12
time     3620  tmp=922818164, max_cycles_shift=13
time     3620  tmp=461409082, max_cycles_shift=14
time     3620  tmp=230704541, max_cycles_shift=15
time     3620  tmp=115352270, max_cycles_shift=16
time     3620  tmp=57676135, max_cycles_shift=17
time     3620  tmp=28838067, max_cycles_shift=18
time     3620  tmp=14419033, max_cycles_shift=19
time     3620  tmp=7209516, max_cycles_shift=20
time     3620  tmp=3604758, max_cycles_shift=21
time     3620  tmp=1802379, max_cycles_shift=22
time     3620  tmp=901189, max_cycles_shift=23
time     3620  tmp=450594, max_cycles_shift=24
time     3620  tmp=225297, max_cycles_shift=25
time     3620  tmp=112648, max_cycles_shift=26
time     3620  tmp=56324, max_cycles_shift=27
time     3620  tmp=28162, max_cycles_shift=28
time     3620  tmp=14081, max_cycles_shift=29
time     3620  tmp=7040, max_cycles_shift=30
time     3620  tmp=3520, max_cycles_shift=31
time     3620  tmp=1760, max_cycles_shift=32
time     3620  tmp=880, max_cycles_shift=33
time     3620  tmp=440, max_cycles_shift=34
time     3620  tmp=220, max_cycles_shift=35
time     3620  tmp=110, max_cycles_shift=36
time     3620  tmp=55, max_cycles_shift=37
time     3620  tmp=27, max_cycles_shift=38
time     3620  tmp=13, max_cycles_shift=39
time     3620  tmp=6, max_cycles_shift=40
time     3620  tmp=3, max_cycles_shift=41
time     3620  tmp=1, max_cycles_shift=42
time     3620  max_cycles_shift=42, 2^max_cycles_shift=4398046511104, nsecs_for_max_cycles=2094382710784, max_cycles_mask=000003ffffffffff
time     3620  cycles_start=8384017322968
cs: Testing 96 CPUs
cs: cpu  1: 7793052934 clocks seen, first 8384028978876
cs: cpu  0: 8381375160 clocks seen, first 8384028978957
cs: cpu 18: 8984479996 clocks seen, first 8384029359561
cs: cpu 19: 9020596626 clocks seen, first 8384029378648
cs: cpu 77: 9646066284 clocks seen, first 8384030085605
cs: cpu 66: 9668251204 clocks seen, first 8384030198530
cs: cpu 67: 9694695476 clocks seen, first 8384029954694
cs: cpu 76: 9728101432 clocks seen, first 8384030073283
cs: cpu 29: 10299812734 clocks seen, first 8384029516492
cs: cpu 32: 10326055618 clocks seen, first 8384029578122
cs: cpu 33: 10335635900 clocks seen, first 8384029621686
cs: cpu 22: 10359726484 clocks seen, first 8384029433518
cs: cpu 23: 10380705216 clocks seen, first 8384029450292
cs: cpu 28: 10481913426 clocks seen, first 8384029499120
cs: cpu 81: 10524493580 clocks seen, first 8384030138723
cs: cpu 80: 10525614626 clocks seen, first 8384030112980
cs: cpu 34: 10646688178 clocks seen, first 8384029635172
cs: cpu 35: 10677218242 clocks seen, first 8384029642492
cs: cpu 46: 10763621646 clocks seen, first 8384030022232
cs: cpu 47: 10770474264 clocks seen, first 8384029735889
cs: cpu 45: 10959943216 clocks seen, first 8384029731064
cs: cpu 44: 10972043564 clocks seen, first 8384029794401
cs: cpu  7: 10979419806 clocks seen, first 8384029202429
cs: cpu 40: 11005414082 clocks seen, first 8384029686533
cs: cpu 41: 11010557802 clocks seen, first 8384029700760
cs: cpu 26: 11084613938 clocks seen, first 8384029466124
cs: cpu 21: 11087466240 clocks seen, first 8384029414888
cs: cpu 20: 11093426500 clocks seen, first 8384029387502
cs: cpu 42: 11103976218 clocks seen, first 8384029701557
cs: cpu 27: 11104677208 clocks seen, first 8384029480694
cs: cpu 43: 11113052800 clocks seen, first 8384029716580
cs: cpu  3: 11115861862 clocks seen, first 8384029003378
cs: cpu  6: 11122689364 clocks seen, first 8384029194589
cs: cpu  4: 11133012164 clocks seen, first 8384029173304
cs: cpu  2: 11134621806 clocks seen, first 8384028987944
cs: cpu  5: 11141595900 clocks seen, first 8384029174899
cs: cpu 24: 11161574114 clocks seen, first 8384029450331
cs: cpu 25: 11171091512 clocks seen, first 8384029452383
cs: cpu 10: 11179057088 clocks seen, first 8384029236057
cs: cpu 11: 11188365742 clocks seen, first 8384029250019
cs: cpu 14: 11190228750 clocks seen, first 8384029289673
cs: cpu 15: 11201988304 clocks seen, first 8384029308416
cs: cpu 13: 11203369686 clocks seen, first 8384029274977
cs: cpu 30: 11204231364 clocks seen, first 8384029528445
cs: cpu 12: 11209657498 clocks seen, first 8384029259565
cs: cpu  9: 11211800854 clocks seen, first 8384029226481
cs: cpu 16: 11214344200 clocks seen, first 8384029315996
cs: cpu 31: 11215837608 clocks seen, first 8384029525056
cs: cpu 17: 11216355618 clocks seen, first 8384029347850
cs: cpu  8: 11224286088 clocks seen, first 8384029211533
cs: cpu 37: 11235545534 clocks seen, first 8384029667928
cs: cpu 38: 11239109800 clocks seen, first 8384029666292
cs: cpu 71: 11241024348 clocks seen, first 8384030033099
cs: cpu 39: 11242653096 clocks seen, first 8384029678418
cs: cpu 89: 11241741722 clocks seen, first 8384030559430
cs: cpu 70: 11243944020 clocks seen, first 8384030081097
cs: cpu 36: 11247668062 clocks seen, first 8384029679480
cs: cpu 53: 11379001532 clocks seen, first 8384029789110
cs: cpu 49: 11379192476 clocks seen, first 8384029753112
cs: cpu 58: 11390592040 clocks seen, first 8384029830979
cs: cpu 59: 11407314432 clocks seen, first 8384029836205
cs: cpu 48: 11413024310 clocks seen, first 8384029752940
cs: cpu 52: 11437681688 clocks seen, first 8384029785814
cs: cpu 82: 11439858296 clocks seen, first 8384030147363
cs: cpu 83: 11440933260 clocks seen, first 8384030170395
cs: cpu 75: 11444549908 clocks seen, first 8384030068453
cs: cpu 73: 11455823170 clocks seen, first 8384030827729
cs: cpu 69: 11462422484 clocks seen, first 8384030022749
cs: cpu 74: 11464634558 clocks seen, first 8384030052064
cs: cpu 72: 11465779740 clocks seen, first 8384030037459
cs: cpu 90: 11465595834 clocks seen, first 8384030702354
cs: cpu 68: 11466978588 clocks seen, first 8384031184389
cs: cpu 91: 11471549770 clocks seen, first 8384030243304
cs: cpu 50: 11484822300 clocks seen, first 8384029765670
cs: cpu 94: 11486513994 clocks seen, first 8384031020050
cs: cpu 51: 11492257268 clocks seen, first 8384029767927
cs: cpu 95: 11424607956 clocks seen, first 8384097494037
cs: cpu 78: 11499519846 clocks seen, first 8384030094384
cs: cpu 79: 11499760010 clocks seen, first 8384030106274
cs: cpu 65: 11515442480 clocks seen, first 8384029940972
cs: cpu 64: 11523827552 clocks seen, first 8384029932624
cs: cpu 54: 11533959492 clocks seen, first 8384029803696
cs: cpu 55: 11536087898 clocks seen, first 8384029803576
cs: cpu 88: 11535986224 clocks seen, first 8384030472827
cs: cpu 93: 11539831814 clocks seen, first 8384030955241
cs: cpu 92: 11541833858 clocks seen, first 8384030860673
cs: cpu 57: 11543314446 clocks seen, first 8384029816768
cs: cpu 62: 11545259284 clocks seen, first 8384029870095
cs: cpu 63: 11546045104 clocks seen, first 8384029900923
cs: cpu 56: 11548052268 clocks seen, first 8384029860218
cs: cpu 61: 11552595848 clocks seen, first 8384029847457
cs: cpu 60: 11554154056 clocks seen, first 8384029843943
cs: cpu 86: 11560057538 clocks seen, first 8384030197299
cs: cpu 87: 11560258686 clocks seen, first 8384030198911
cs: cpu 85: 11564585216 clocks seen, first 8384030182592
cs: cpu 84: 11565469068 clocks seen, first 8384030181962
cs: Pass!
PS C:\Program Files\fio>