Experiment with GDAL for IO

ViralBShah commented 10 years ago

GDAL used to be faster than numpy's IO routines at one point. Need to experiment again with GDAL if it is worthwhile to use, or remove the code for calling GDAL otherwise.

ViralBShah commented 10 years ago

@tanmaykm Do you think if we use mmap to read the files, there will be savings in memory, or higher performance?

ViralBShah commented 10 years ago

Or quite likely, since the files are text files, we cannot mmap them.

tanmaykm commented 10 years ago

Yes, I doubt if mmap would help much with the current file formats as these are text files and not direct memory representations.

It is possible to map arrays (including numpy arrays) to memory though. If these formats are interoperable with other software that may use outputs from circuitscape, we can probably use them.

bmcrae commented 10 years ago

Numpy arrays aren't really interoperable with other software unfortunately. I have code in Circuitscape to read/write them because I have an ArcGIS (Linkage Mapper) that passes data to/from Circuitscape. But I have to essentially use my own protocol, passing header information in a separate text file. I use Numpy arrays because they are much faster to read/write than text files (although GDAL could make up for this). GDAL might add considerably to the package size if I remember correctly.

Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml

On Sun, Nov 10, 2013 at 5:35 PM, Tanmay Mohapatra notifications@github.comwrote:

Yes, I doubt if mmap would help much with the current file formats as these are text files and not direct memory representations.

It is possible to map arrays (including numpy arrays) to memory though. If these formats are interoperable with other software that may use outputs from circuitscape, we can probably use them.

— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28167059 .

tanmaykm commented 10 years ago

Did a IO test with numpy binary save, savetxt, and gdal . Numpy binary save is the fastest. GDAL seems to be the slowest of the lot. Numpy arrays seem to be the best way to exchange data, with the binary format a big bonus where we can have something like the linkage mapper for ArcGIS on the other side.

numpy savetxt time=3.02454590797
gdal time=5.78220915794
numpy save time=0.459983110428

I was using GDAL-1.10.1. I guess the results indicate we should remove the GDAL option altogether.

Here's the code I used for the test: https://gist.github.com/tanmaykm/7418142

bmcrae commented 10 years ago

Great info, thanks!

Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml

On Mon, Nov 11, 2013 at 10:45 AM, Tanmay Mohapatra <notifications@github.com

wrote:

Did a IO test with numpy binary save, savetxt, and gdal . Numpy binary save is the fastest. GDAL seems to be the slowest of the lot. Numpy arrays seem to be the best way to exchange data, with the binary format a big bonus where we can have something like the linkage mapper for ArcGIS on the other side.

numpy savetxt time=3.02454590797 gdal time=5.78220915794 numpy save time=0.459983110428

I was using GDAL-1.10.1. I guess the results indicate we should remove the GDAL option altogether.

Here's the code I used for the test: https://gist.github.com/tanmaykm/7418142

— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28226832 .

bmcrae commented 10 years ago

To me it looks like IO is a relatively minor speed issue. Right now, reading a 96m cell grid takes130 sec. That compares with 815 sec to construct_g_graph, and a total of 1347 for construct_component map.

Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml

On Mon, Nov 11, 2013 at 10:48 AM, Brad McRae mcrae@circuitscape.org wrote:

Great info, thanks!

Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml

On Mon, Nov 11, 2013 at 10:45 AM, Tanmay Mohapatra < notifications@github.com> wrote:

Did a IO test with numpy binary save, savetxt, and gdal . Numpy binary save is the fastest. GDAL seems to be the slowest of the lot. Numpy arrays seem to be the best way to exchange data, with the binary format a big bonus where we can have something like the linkage mapper for ArcGIS on the other side.

numpy savetxt time=3.02454590797 gdal time=5.78220915794 numpy save time=0.459983110428

I was using GDAL-1.10.1. I guess the results indicate we should remove the GDAL option altogether.

Here's the code I used for the test: https://gist.github.com/tanmaykm/7418142

— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28226832 .

tanmaykm commented 10 years ago

True. IO is not an issue in comparison to the other stuff, and there seem to be ways to tackle IO when required.

ViralBShah commented 10 years ago

Brad, can you provide a benchmark case that will exercise loss of file IO on a decent problem size? This could be in the BigTests repo.

ViralBShah commented 10 years ago

I mean lots of file io.

ViralBShah commented 10 years ago

Also writing is likely to be much slower than reading. The test case should be about writing lots of large current maps.

bmcrae commented 10 years ago

Sure, I can make such a test case. I just did some tests with a 6m problem and IO is pretty minimal. Reading maps takes 6-8 secs, writing them takes

This is out of ~400 seconds total. The long apparent time for current mapping is all taken up in calculating the map values themselves based on voltages. So there's probably a lot more to be gained from making that process (generating current map values) more efficient than making the writeaaigrid process more efficient.

So, speed hurdles seem to be: amg hierarchy solver creating (not writing) current maps.

Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml

On Tue, Nov 12, 2013 at 12:00 AM, Viral B. Shah notifications@github.comwrote:

Also writing is likely to be much slower than reading. The test case should be about writing lots of large current maps.

— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28274819 .

bmcrae commented 10 years ago

I should note that I'm using an SSD. That could explain fast read/write times on my end.

bmcrae commented 10 years ago

Just added 18 more points to the 6m large test case in BigTests repo. This will create lots of i/o (190 pairs total).

Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml

On Tue, Nov 12, 2013 at 8:33 AM, Brad McRae mcrae@circuitscape.org wrote:

Sure, I can make such a test case. I just did some tests with a 6m problem and IO is pretty minimal. Reading maps takes 6-8 secs, writing them takes 3. This is out of ~400 seconds total. The long apparent time for current mapping is all taken up in calculating the map values themselves based on voltages. So there's probably a lot more to be gained from making that process (generating current map values) more efficient than making the writeaaigrid process more efficient.

So, speed hurdles seem to be: amg hierarchy solver creating (not writing) current maps.

Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml

On Tue, Nov 12, 2013 at 12:00 AM, Viral B. Shah notifications@github.comwrote:

Also writing is likely to be much slower than reading. The test case should be about writing lots of large current maps.

— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28274819 .

ViralBShah commented 10 years ago

Great. This test case is good to have. If we are spending lots of time in amg hierarchy and solver, then we are doing the right thing. :-)

I will look into speeding it up, but I am not terribly hopeful.

Let's get rid of GDAL altogether. We should use numpy save for internal usage and API calls, and savetxt otherwise if.

tanmaykm commented 10 years ago

Oops, I had closed this accidentally with my commit. Reopening.

ViralBShah commented 10 years ago

@bmcrae Could you also add a few more points to the 1m BigTests, so that it will test more pairs?

bmcrae commented 10 years ago

Just added a 20-point test case. Let me know if you would like a different number and will add that too.

Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml

On Tue, Nov 12, 2013 at 10:18 PM, Viral B. Shah notifications@github.comwrote:

@bmcrae https://github.com/bmcrae Could you also add a few more points to the 1m BigTests, so that it will test more pairs?

— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28368810 .

Circuitscape / Circuitscape.py

Experiment with GDAL for IO #13