reduce memory consumption

vlandau commented 4 years ago

I'm getting out of memory errors using 1.2 Gb resistance surface on one worker, so I think a closer look needs to be taken at memory consumption. This is on a 32GB RAM machine, so this is no good.

One idea off the bat is to calculate inputs to 32-bit precision, but I think there are likely many more gains to be had by tweaking the code itself.

In theory, the Circuitscape problems themselves should be quite small, <<1M pixels, so most of the consumption is probably happening on the Omniscape side of things.

@ranjanan and @ViralBShah I know you're both busy, but just cc'ing you here in case you have some tips or direction on how to go about addressing this.

ViralBShah commented 4 years ago

Yes, with 1M pixels, Circuitscape memory usage shouldn't be an issue. Have you tried memory allocation profiling? https://docs.julialang.org/en/v1/manual/profile/#Memory-allocation-analysis-1

ViralBShah commented 4 years ago

I am going to guess that 32-bit is probably going to not give you meaningful savings.

vlandau commented 4 years ago

That's good to know. I'll look into profiling. I haven't tried it before. Thanks!

@ranjan can you speak at all to the savings in memory Circuitscape got with 32 bit (or maybe that mostly done for speed savings)?

ViralBShah commented 4 years ago

Note that it is a bit rough and the actual allocation it reports may be off by a few lines. I still find it quite useful.

vlandau commented 4 years ago

Tacking this on here, noting that readdlm from DelimitedFiles uses a ton of memory (>5GB) to read in a 1.2GB .asc file and it all remains allocated.

ViralBShah commented 4 years ago

Probably should use something else. Can CSV.jl read these files? Or make gdal the default reader?

readdlm does not receive too much attention. It is not meant for such large files. More of a convenience function.

ranjanan commented 4 years ago

CSV.jl currently seems to error out while reading our ASC files: https://github.com/JuliaData/CSV.jl/issues/583. As for GDAL, I did an informal benchmark trying to read a 300 MB ASC file (https://github.com/Circuitscape/BigTests/blob/master/96m/cellmap.asc.gz):

julia> @time a = readdlm("../96m/cellmap.asc", skipstart=6);
 22.969700 seconds (1.92 M allocations: 4.173 GiB, 4.06% gc time)

and with ArchGDAL:

julia> @time b = ArchGDAL.read(ArchGDAL.read("../96m/cellmap.asc"), 1);
 14.599338 seconds (86.79 k allocations: 370.527 MiB, 0.13% gc time)

We certainly need to shift away from readdlm. Let's wait for CSV.jl benchmark as well. It supports multithreaded reading, so lets see how much gain we get from that as well.

vlandau commented 4 years ago

Thanks! That is great advice. I might play around with ArchGDAL. Would I also add GDAL_jll.jl as a dependency so that GDAL gets installed (easily)?

EDIT: I think maybe I don't need GDAL_jll.jl, looks like GDAL.jl (for which ArchGDAL is a wrapper) builds from binaries for you.

vlandau commented 4 years ago

julia> @time b = ArchGDAL.read(ArchGDAL.read("../96m/cellmap.asc"), 1);

@ranjanan why do you have to wrap ArchGDAL.read with ArchGDAL.read here?

EDIT: never mind, I think I've got it. Looks like ArchGDAL.read from a filename creates an ArchGDAL dataset, then read has another method to get the actual values from that dataset.

vlandau commented 4 years ago

Alright! Got a memory allocation profile for the functions called in Omniscape.

There appear to be some very obvious culprits.

The first one off the bat that I identified was clip(), and I've got an idea for how to significantly reduce the memory demands of that function.

Not sure what is going on here, or here.

ViralBShah commented 4 years ago

Aren't those operations just creating large arrays?

ViralBShah commented 4 years ago

Also clip could easily avoid array operations using a loop, but the garbage generated should get collected anyways. Curious to see what helps.

ViralBShah commented 4 years ago

What's going on with the null_current_total line?

vlandau commented 4 years ago

null_current_total has to do with an artifact correction when using "blocks" as moving window centers instead of individual pixels. I basically translated what Brad had already implemented in Omniscape.py into Julia. I'll have to revisit his code to see if there's a better way I can implement it.

I'm wondering if GC is not working efficiently? For example, memory that was allocated during readdlm does not seem to be garbage collected (AKAICT). Switching to a function other than readdlm could solve that specific issue, but was just using it as an example.

vlandau commented 4 years ago

Also, maybe that null_current_total is so large because that loop caused the specific line to be evaluated 81*81 times in the run that I test on? New to memory profiling, so not sure exactly how it works :slightly_smiling_face:

ViralBShah commented 4 years ago

Yes, that would be because it is allocating the array each time in a loop. That's probably leading to poor performance too.

ViralBShah commented 4 years ago

In some cases replace! may work better as a way to replace values with -9999 etc. Sometimes it may be slower, but save memory. Worth trying.

ViralBShah commented 4 years ago

Inside the loops, depending on how small the inner loop is, arguments["radius"] could be better to manually hoist outside the loop (assign to a variable outside the loop) so you don't have to repeatedly pay the cost inside the loop.

ViralBShah commented 4 years ago

In general, reducing memory usage may also improve performance.

vlandau commented 4 years ago

Thanks so much for the suggestions and insights! I'll try some of those things out.

ViralBShah commented 4 years ago

I am sure you have seen this. But many of those tips are related to memory usage.

https://docs.julialang.org/en/v1/manual/performance-tips/

vlandau commented 4 years ago

Note to self: I think this deepcopy is entirely unnecessary.

vlandau commented 4 years ago

This massive memory using line is only run once before the main Omniscape program starts up in parallel, so there is some leeway.

vlandau commented 4 years ago

Linking #36, which was just closed by #38

vlandau commented 4 years ago

Omniscape is now running about 3-4 times faster in serial for a large problem I'm working with since applying some of these fixes, and I'm able to run it on 3 times more parallel processes! :tada:

ViralBShah commented 4 years ago

Close?

vlandau commented 4 years ago

Almost! Just checking a few more things.

vlandau commented 4 years ago

Actually, I might as well close. I'm thinking allowing a 32-bit option will save GBs of memory since Omniscape will in some cases be working with up to 6 separate arrays (for large problems with climate connectivity enabled), each GB's in size. Will create another issue for that though!

ViralBShah commented 4 years ago

Yes makes sense to do that now.

Circuitscape / Omniscape.jl

reduce memory consumption #27