firelab / windninja

A diagnostic wind model developed for use in wildland fire modeling.
https://weather.firelab.org/windninja/
Other
115 stars 44 forks source link

Investigate file writing/reading speed performance on windows #147

Open jforthofer opened 8 years ago

jforthofer commented 8 years ago

The new CFD simulations are much slower on Windows than Linux. It seems especially slow at the beginning of the simulations when a lot of file writing is happening (converting DEM to STL, reading/writing/copying OpenFOAM dict files, etc.). I have a feeling that this may have to do with using GDAL's VSI functions rather than standard library functionality. I noticed there is a buffer size set in the DEM to STL writing, maybe adjusting this would help? Maybe the VSI stuff just has too much overhead? Why would It be slow on Windows but not Linux... not sure. It would be good to do some tests to see if this is the source of the problem or not. I think we have a stand alone DEM to STL converter, so that might be a good isolated place to check out.

ksshannon commented 8 years ago

Can we please make sure this isn't related to full disk encryption? You guys may have figured this out, but I haven't seen anything. I can't reproduce on my VM, and I didn't seem to have a terrible slowdown when using another windows machine.

nwagenbrenner commented 8 years ago

There may be some slow down associated with encryption, but it's not the whole problem. The slowdown happens on Windows desktops too, which are not encrytped.

ksshannon commented 8 years ago

I don't see it being code related, or the VM's would suffer too. #define VSIFILEL FILE and #define VSIFWriteL fwrite may work as a test in the STL converter.

On Sat, May 21, 2016 at 9:21 AM, Natalie Wagenbrenner < notifications@github.com> wrote:

There may be some slow down associated with encryption, but it's not the whole problem. The slowdown happens on Windows desktops too, which are not encrytped.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/firelab/windninja/issues/147#issuecomment-220783466

Kyle

nwagenbrenner commented 8 years ago

So Kyle, are you saying a ninjafoam simulation runs in the same amount of time on your linux and your Windows vm?

ksshannon commented 8 years ago

Yes, I believe it did, I will check again. If I remember right, it did on your VM too. I look for the email.

On Sat, May 21, 2016 at 9:28 AM, Natalie Wagenbrenner < notifications@github.com> wrote:

So Kyle, are you saying a ninjafoam simulation runs in the same amount of time on your linux and your Windows vm?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/firelab/windninja/issues/147#issuecomment-220783867

Kyle

ksshannon commented 8 years ago

From an email on 2016-02-25 from Natalie:

Yeah, I was running on a laptop.

I ran the coarse hi.tif case on my personal laptop with Ubuntu 14.04 and Windows 8.1:

Ubuntu: 48 min Windows: 44 min

On my FS laptop: 2 hours

The big difference seems to be in the meshing. Meshing is 45% of the total simulation time on Windows, but only 2% of the time on linux. MDM runs SO SLOW on the FS machine.

I'm testing to see what we can reduce the solver iterations to right now. I still can't run in parallel on any windows machine.

Is this different now?

nwagenbrenner commented 8 years ago

I think that test must have been using 1 thread. Yeah, meshing speeds improved on Windows when we switched to DP. But you're right, looks like my personal Windows machine was comparable to linux speed. I dont remember my linux laptop being so slow...must be though if that's what I said. This is a weird problem that is difficult to test. Maybe it's related to the FS image. On the FS machines all i/o stuff seems slower. This includes stl creation, refineMesh, and decomposing/reconstructing the domain. On May 21, 2016 9:35 AM, "Kyle Shannon" notifications@github.com wrote:

From an email on 2016-02-25:

Yeah, I was running on a laptop.

I ran the coarse hi.tif case on my personal laptop with Ubuntu 14.04 and Windows 8.1:

Ubuntu: 48 min Windows: 44 min

On my FS laptop: 2 hours

The big difference seems to be in the meshing. Meshing is 45% of the total simulation time on Windows, but only 2% of the time on linux. MDM runs SO SLOW on the FS machine.

I'm testing to see what we can reduce the solver iterations to right now. I still can't run in parallel on any windows machine.

Is this different now?

— You are receiving this because you commented.

Reply to this email directly or view it on GitHub https://github.com/firelab/windninja/issues/147#issuecomment-220784245

ksshannon commented 8 years ago

I think that test must have been using 1 thread.

It was, it was elsewhere in the email.

Yeah, meshing speeds improved on Windows when we switched to DP. But you're right, looks like my personal Windows machine was comparable to linux speed. I dont remember my linux laptop being so slow...must be though if that's what I said. This is a weird problem that is difficult to test. Maybe it's related to the FS image. On the FS machines all i/o stuff seems slower. This includes stl creation, refineMesh, and decomposing/reconstructing the domain.

Could anti-virus software be doing it too? Can we disable that for testing?

ksshannon commented 8 years ago

@jforthofer

I have a feeling that this may have to do with using GDAL's VSI functions rather than standard library functionality.

You realize that VSI is being run on both platforms, right? The wrappers are slightly different on VSIFWriteL(), but unix has an extra if statement before fwrite() is called.

I noticed there is a buffer size set in the DEM to STL writing, maybe adjusting this would help?

Can you point me to this, I can't find it.

Maybe the VSI stuff just has too much overhead?

Again, the overhead is equal, if not more on unix.

Why would It be slow on Windows but not Linux... not sure.

It appears all the OpenFOAM writing is taking extra time too, and they aren't using VSI (last time I checked). I don't think VSI is the right place to look, and I think it's a red herring for you.

It would be good to do some tests to see if this is the source of the problem or not. I think we have a stand alone DEM to STL converter, so that might be a good isolated place to check out.

Testing on different systems, and documenting performance should be done first. On this ticket ideally.

nwagenbrenner commented 8 years ago

I'll see about disabling anti-virus for testing.

Also, just for completeness, the times I reported in Feb for hi.tif were slower than typical coarse cases, because that case was failing near the end of the simulation, coarsening and rerunning. I modified the meshing after that, so that type of failure should not happen anymore.

We should just rerun the previous run time comparisons with our current code. On May 21, 2016 10:02 AM, "Kyle Shannon" notifications@github.com wrote:

I think that test must have been using 1 thread.

It was, it was elsewhere in the email.

Yeah, meshing speeds improved on Windows when we switched to DP. But you're right, looks like my personal Windows machine was comparable to linux speed. I dont remember my linux laptop being so slow...must be though if that's what I said. This is a weird problem that is difficult to test. Maybe it's related to the FS image. On the FS machines all i/o stuff seems slower. This includes stl creation, refineMesh, and decomposing/reconstructing the domain.

Could anti-virus software be doing it too? Can we disable that for testing?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/firelab/windninja/issues/147#issuecomment-220785728

ksshannon commented 8 years ago

Benchmarks for bigbutte_small.tif, coarse, 10mph, 225 deg (from email 2016-04-09).

Kyle Ubuntu 14.04, Intel i5 Quad Core 2.2 GHz (SSD): 7m40s Ubuntu 14.04, Intel Xeon Dual 8-core 2.4 GHz: 4m06s Windows 7, Intel Core i7-2630 8-core, 2.0 GHz: 15m24s Windows 7 Client on Ubuntu qemu Host, 2 cores: 22m48s

ksshannon commented 8 years ago

Natalie benchmarks from email above, specifics unknown:

Intel i5-4200u, 1.6 GHz (base), 2.29 GHz (max):

Ubuntu 14.04: 8.5 min Windows 8.1: 24 min

ksshannon commented 8 years ago

On Sat, May 21, 2016 at 10:16 AM, Natalie Wagenbrenner notifications@github.com wrote:

I'll see about disabling anti-virus for testing.

Also, just for completeness, the times I reported in Feb for hi.tif were slower than typical coarse cases, because that case was failing near the end of the simulation, coarsening and rerunning. I modified the meshing after that, so that type of failure should not happen anymore.

We should just rerun the previous run time comparisons with our current code.

Yes, I agree.

On May 21, 2016 10:02 AM, "Kyle Shannon" notifications@github.com wrote:

I think that test must have been using 1 thread.

It was, it was elsewhere in the email.

Yeah, meshing speeds improved on Windows when we switched to DP. But you're right, looks like my personal Windows machine was comparable to linux speed. I dont remember my linux laptop being so slow...must be though if that's what I said. This is a weird problem that is difficult to test. Maybe it's related to the FS image. On the FS machines all i/o stuff seems slower. This includes stl creation, refineMesh, and decomposing/reconstructing the domain.

Could anti-virus software be doing it too? Can we disable that for testing?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/firelab/windninja/issues/147#issuecomment-220785728

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

Kyle

nwagenbrenner commented 8 years ago

To narrow it down a bit, I compared just stl_create for big_butte_small.tif (as requested by Jason).

Windows 7 FS laptop (from Audrey), non-encrypted, P8700 2.5 GHz: 13 s Windows 7 FS laptop (mine), encrypted, i7-2620M 2.7 GHz: 8 s Ubuntu 14.04 (my desktop), X5677 3.47 GHz: < 1 s

jforthofer commented 8 years ago

So this is good that the problem has been narrowed down to the kind of stuff done in stl_create. One thing to note is that I don't believe this is compiled during the cross-compile part, it is compiled on Windows natively using Visual Studio, right? This eliminates the possibility of problems in the cross compile (like linking to different C++ runtime libraries).

nwagenbrenner commented 8 years ago

stl_create is WindNinja code. It's not cross-compiled.

ksshannon commented 8 years ago

Is stl creation the only thing slowed down? I thought everything was running slower, the OpenFOAM stuff included.

nwagenbrenner commented 8 years ago

No, it's not just stl creation. It seems to be I/O things that are slower, which includes a lot of OpenFOAM stuff (e.g., domain decomp/reconstruction, refineMesh, and other things...).

ksshannon commented 8 years ago

To narrow it down a bit, I compared just stl_create for big_butte_small.tif (as requested by Jason).

Windows 7 FS laptop (from Audrey), non-encrypted, P8700 2.5 GHz: 13 s Windows 7 FS laptop (mine), encrypted, i7-2620M 2.7 GHz: 8 s Ubuntu 14.04 (my desktop), X5677 3.47 GHz: < 1 s

Can we do the same with antivirus turned off? Was stl_create compiled in release or debug mode?

No, it's not just stl creation. It seems to be I/O things that are slower, which includes a lot of OpenFOAM stuff (e.g., domain decomp/reconstruction, refineMesh, and other things...).

So we have two completely separate processes, built with two completely separate runtimes, built with two completely different compilers, using two separate APIs suffering the same issue. OpenFOAM likely uses C++ streams for I/O. GDAL uses POSIX API on unix, Windows C API on windows. To me, this points to an external factor, not code. But maybe that is already agreed upon, I can't tell. I changed the ticket name because it didn't make any sense to me.

jforthofer commented 8 years ago

Kyle, the buffer I was talking about earlier is here: https://github.com/firelab/windninja/blob/master/src/ninja/stl_create.cpp#L236

I think Natalie is going to replace this code with C++ standard library equivalent code to see if things speed up. Another thing to test is to recode this with the VSI stuff included but put everything into the buffer instead of small buffer writes. Last, if none of this has any effect, we need to test with antivirus off. The only reason we're doing this stuff in this order is because Natalie thought it would be fastest (might be a bit of work for us to get an antivirus-off computer together).

ksshannon commented 8 years ago

Kyle, the buffer I was talking about earlier is here: https://github.com/firelab/windninja/blob/master/src/ninja/stl_create.cpp#L236

Yes, that would help by calling fsync() less, the underlying buffer is probably 1024, 2048 or 4096 bytes or somewhere along those lines, but you have to be very careful about how you write binary files. You can't just call:

typedef struct s {
    short a;
    float b;
    double c;
} s;
// ...
s m;
fwrite(&m, sizeof( s ), 1, fout);

without a properly packed struct, or modifying how the compiler handles structs. That is probably why it's currently done the way it's done. It does appear that our struct is properly aligned (homogeneous members), so it might work. Another option would be to dump the struct and just use one big array of floats.

I think Natalie is going to replace this code with C++ standard library equivalent code to see if things speed up. Another thing to test is to recode this with the VSI stuff included but put everything into the buffer instead of small buffer writes. Last, if none of this has any effect, we need to test with antivirus off. The only reason we're doing this stuff in this order is because Natalie thought it would be fastest (might be a bit of work for us to get an antivirus-off computer together).

If you #define some stuff, it should be a couple of lines in stl_create.cpp, at least to use C stdio functions.

I think you will get almost no performance boost out of it, because (I assume) OpenFOAM uses std::streams, but wtf do I know. You guys do what you want, I'm out on this one.

VSI calls for windows:

https://github.com/ksshannon/gdal/blob/trunk/gdal/port/cpl_vsil_win32.cpp#L308-L326

for unix:

https://github.com/ksshannon/gdal/blob/trunk/gdal/port/cpl_vsil_unix_stdio_64.cpp#L398-L440

nwagenbrenner commented 8 years ago

For another (non-stl_create) comparison, below are times for decomposePar -force on big_butte_small.tif (in WindNinja). Decomposing to 4 processors.

my Linux desktop (same as above): 11 s my Windows laptop (same as above): 1 min 24 s

During decomposePar polyMesh/*, U, k, p, and epsilon are copied over to new directories (one for each processor). OpenFOAM calls regIOobject::write().

Still not sure why we're seeing this slowdown in stl_create (which uses VSIFWriteL()) and also in the OpenFOAM code (which uses std::ofstream), but not in other types of file writing we do (e.g., writing kmzs). Or why the behavior would be different on Windows vs. Linux. But it's definitely related to I/O. There are Windows specific changes related to I/O in our patched version, e.g.:

https://github.com/firelab/OpenFOAM-2.2.x/blob/patched/src/OpenFOAM/db/IOstreams/Fstreams/OFstream.C#L75-78

I also see OFstream::debug is turned on for Windows but not Linux:

https://github.com/firelab/OpenFOAM-2.2.x/blob/master/v3-mingw-openfoam-2-2-x.patch#L24591-24592

Maybe the first thing to try is flipping off the debug switch in the Windows build to see if that has an effect on speed.

nwagenbrenner commented 8 years ago

Unsetting the OFStream debug switch did not affect the run time for decomposePar. It did prevent a bunch of garbage from being spewed out.

nwagenbrenner commented 8 years ago

I changed VSIFWriteL to fwrite in stl_create and re-ran the conversion for big_butte_small.tif on my Linux laptop and my Windows laptop. It now takes less than 1 second on both. So it appears something is going on in VSIFWriteL that is affecting performance on Windows. Still don't know what is causing the slowdown for OpenFOAM file writing on Windows.

ksshannon commented 8 years ago

I still don't think it's the VSI layer, it is probably the use of the native win32 API, which flushes at each call:

http://stackoverflow.com/questions/14290337/is-fwrite-faster-than-writefile-in-windows

nwagenbrenner commented 8 years ago

I think the slow OpenFOAM processes on Windows may be related to our CPLSpawn calls.

https://github.com/firelab/windninja/blob/master/src/ninja/ninjafoam.cpp#L1430

The output is redirected to log files and some of them get pretty big. CPLSpawn uses VSIFWriteL. I'm looking into it.

ksshannon commented 8 years ago

Are the times for decomposePar above while running in WindNinja, or standalone?

nwagenbrenner commented 8 years ago

The times previously reported were for decomposePar while running in WindNinja.

nwagenbrenner commented 8 years ago

Standalone, decomposePar runs in 8 seconds on my windows laptop.

nwagenbrenner commented 8 years ago

Same with surfaceTransformPoints and blockMesh. They run much faster standalone than inside of WindNinja on Windows.

ksshannon commented 8 years ago

Okay, I misunderstood. I thought those were run standalone.

ksshannon commented 8 years ago

I am not sure, but I don't think there is a way around the CPLSpawn issue without writing a process library. We could just cut/paste GDAL's and use it, but change the functions to take FILE*.

nwagenbrenner commented 8 years ago

Sorry for the confusion. Before I was just reporting times spit out at the end of the WindNinja simulation. Today was the first time I really compared standalone vs. via WindNinja. I just didn't really consider CPLSpawn being an issue until today.

ksshannon commented 8 years ago

Likely the same VSIFWriteL issue, unfortunately (for so many reasons). CPLSpawn is calling it, but I think it might be buffered, but apparently not much. As a test, maybe pass NULL to the processes for the stdout, and if it runs faster, that's it. I assume that's how you handle progress, reading the logs, so that will be messed up, but the speed should change, I think.

nwagenbrenner commented 8 years ago

Yeah, that's what I tried today, but CPLSpawn didn't like NULL for the stdout. I tried it just for the surfaceTransformPoints call on linux and it just kept failing. I'll look at it closer tomorrow.

ksshannon commented 8 years ago

You'll probably have to comment out all of your reads as well. I won't armchair from here...

nwagenbrenner commented 8 years ago

What do mean "reads"? You mean related to parsing the stdout for progress? Yeah, I can do that later... But do you see why passing NULL as the fout parameter would cause a problem in SurfaceTransformPoints?

https://github.com/firelab/windninja/blob/master/src/ninja/ninjafoam.cpp#L1424

I don't see why that should be an issue, but it is.

ksshannon commented 8 years ago

It shouldn't be. I just thought of a fix. I'm riding. I can implement by tomorrow night. Maybe tonight. On Jun 13, 2016 5:35 PM, "Natalie Wagenbrenner" notifications@github.com wrote:

What do mean "reads"? You mean related to parsing the stdout for progress? Yeah, I can do that later... But do you see why passing NULL as the fout parameter would cause a problem in SurfaceTransformPoints?

https://github.com/firelab/windninja/blob/master/src/ninja/ninjafoam.cpp#L1424

I don't see why that should be an issue, but it is.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/firelab/windninja/issues/147#issuecomment-225739258, or mute the thread https://github.com/notifications/unsubscribe/AAwVDUoiii5tAmzbLZTxszJzHQCiCZZyks5qLekkgaJpZM4IjnYj .

ksshannon commented 8 years ago

Natalie, It might take a little work to get a proper buffered writer installed in VSI. In the mean time, you can write logs to in-memory files using filenames such as /vsimem/somename or /vsimem//home/with/abs/path. They will be held in memory, and before closing, call CopyFile() on them to the correct file path. I'll look into providing a buffered writer, or another option.

nwagenbrenner commented 8 years ago

I'm starting to think this is not related to CPLSpawn after all. Previous times reported on this ticket may have been before we discovered the power settings issue on Windows (everything was running much slower then). At any rate, decomposePar and surfaceTransformPoints now seem to run at about the same speed on Windows, whether standalone or via WindNinja. For clarification, here are the execution times for decomposePar on big_butte_small.tif:

Windows standalone: 9.35s Windows via WindNinja: 9.96s

Now I think there may be something specific to our cross-compiled OF slowing things down. See execution times for checkMesh on big_butte_small.tif for our cross-compiled OF, BlueCFD's OF for Windows, and Linux (all standalone from the command line):

our cx-compiled OF (my Windows laptop): 24.85s BlueCFD OF (my Windows laptop): 10.75s Linux: 4.35s

Again, sorry for the confusion. There have been several issues slowing down performance on Windows (STL writing issue related to WriteFile(), Windows power settings throttling processes, OF SP vs. DP build) and that has complicated things.

ksshannon commented 8 years ago

On Tue, Jun 14, 2016 at 2:26 PM, Natalie Wagenbrenner < notifications@github.com> wrote:

I'm starting to think this is not related to CPLSpawn after all. Previous times reported on this ticket may have been before we discovered the power settings issue on Windows (everything was running much slower then). At any rate, decomposePar and surfaceTransformPoints now seem to run at about the same speed on Windows, whether standalone or via WindNinja. For clarification, here are the execution times for decomposePar on big_butte_small.tif:

Windows standalone: 9.35s Windows via WindNinja: 9.96s

Now I think there may be something specific to our cross-compiled OF slowing things down. See execution times for checkMesh on big_butte_small.tif for our cross-compiled OF, BlueCFD's OF for Windows, and Linux (all standalone from the command line):

our cx-compiled OF (my Windows laptop): 24.85s BlueCFD OF (my Windows laptop): 10.75s Linux: 4.35s

That is still pretty crappy win32 Blue vs linux.

Again, sorry for the confusion. There have been several issues slowing down performance on Windows (STL writing issue related to WriteFile(), Windows power settings throttling processes, OF SP vs. DP build) and that has complicated things.

No worries, it's a big ticket. Should I not worry about implementing a buffered reader? I can try the vsimem method, low effort, and it should help if it is the problem.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/firelab/windninja/issues/147#issuecomment-226005521, or mute the thread https://github.com/notifications/unsubscribe/AAwVDYQe4ZineRlIlkWk8u6iNuFX7Kliks5qLw6OgaJpZM4IjnYj .

Kyle

nwagenbrenner commented 8 years ago

I don't think you need to worry about a buffered reader. And I did try the vsimem method you mentioned and it didn't seem to improve the speed at all. Thanks though.