Closed ViralBShah closed 10 years ago
@tanmaykm Do you think if we use mmap
to read the files, there will be savings in memory, or higher performance?
Or quite likely, since the files are text files, we cannot mmap them.
Yes, I doubt if mmap
would help much with the current file formats as these are text files and not direct memory representations.
It is possible to map arrays (including numpy arrays) to memory though. If these formats are interoperable with other software that may use outputs from circuitscape, we can probably use them.
Numpy arrays aren't really interoperable with other software unfortunately. I have code in Circuitscape to read/write them because I have an ArcGIS (Linkage Mapper) that passes data to/from Circuitscape. But I have to essentially use my own protocol, passing header information in a separate text file. I use Numpy arrays because they are much faster to read/write than text files (although GDAL could make up for this). GDAL might add considerably to the package size if I remember correctly.
Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml
On Sun, Nov 10, 2013 at 5:35 PM, Tanmay Mohapatra notifications@github.comwrote:
Yes, I doubt if mmap would help much with the current file formats as these are text files and not direct memory representations.
It is possible to map arrays (including numpy arrays) to memory though. If these formats are interoperable with other software that may use outputs from circuitscape, we can probably use them.
— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28167059 .
Did a IO test with numpy binary save
, savetxt
, and gdal
. Numpy binary save
is the fastest. GDAL seems to be the slowest of the lot. Numpy arrays seem to be the best way to exchange data, with the binary format a big bonus where we can have something like the linkage mapper for ArcGIS on the other side.
numpy savetxt time=3.02454590797
gdal time=5.78220915794
numpy save time=0.459983110428
I was using GDAL-1.10.1
. I guess the results indicate we should remove the GDAL
option altogether.
Here's the code I used for the test: https://gist.github.com/tanmaykm/7418142
Great info, thanks!
Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml
On Mon, Nov 11, 2013 at 10:45 AM, Tanmay Mohapatra <notifications@github.com
wrote:
Did a IO test with numpy binary save, savetxt, and gdal . Numpy binary save is the fastest. GDAL seems to be the slowest of the lot. Numpy arrays seem to be the best way to exchange data, with the binary format a big bonus where we can have something like the linkage mapper for ArcGIS on the other side.
numpy savetxt time=3.02454590797 gdal time=5.78220915794 numpy save time=0.459983110428
I was using GDAL-1.10.1. I guess the results indicate we should remove the GDAL option altogether.
Here's the code I used for the test: https://gist.github.com/tanmaykm/7418142
— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28226832 .
To me it looks like IO is a relatively minor speed issue. Right now, reading a 96m cell grid takes130 sec. That compares with 815 sec to construct_g_graph, and a total of 1347 for construct_component map.
Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml
On Mon, Nov 11, 2013 at 10:48 AM, Brad McRae mcrae@circuitscape.org wrote:
Great info, thanks!
Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml
On Mon, Nov 11, 2013 at 10:45 AM, Tanmay Mohapatra < notifications@github.com> wrote:
Did a IO test with numpy binary save, savetxt, and gdal . Numpy binary save is the fastest. GDAL seems to be the slowest of the lot. Numpy arrays seem to be the best way to exchange data, with the binary format a big bonus where we can have something like the linkage mapper for ArcGIS on the other side.
numpy savetxt time=3.02454590797 gdal time=5.78220915794 numpy save time=0.459983110428
I was using GDAL-1.10.1. I guess the results indicate we should remove the GDAL option altogether.
Here's the code I used for the test: https://gist.github.com/tanmaykm/7418142
— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28226832 .
True. IO is not an issue in comparison to the other stuff, and there seem to be ways to tackle IO when required.
Brad, can you provide a benchmark case that will exercise loss of file IO on a decent problem size? This could be in the BigTests repo.
I mean lots of file io.
Also writing is likely to be much slower than reading. The test case should be about writing lots of large current maps.
Sure, I can make such a test case. I just did some tests with a 6m problem and IO is pretty minimal. Reading maps takes 6-8 secs, writing them takes
So, speed hurdles seem to be: amg hierarchy solver creating (not writing) current maps.
Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml
On Tue, Nov 12, 2013 at 12:00 AM, Viral B. Shah notifications@github.comwrote:
Also writing is likely to be much slower than reading. The test case should be about writing lots of large current maps.
— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28274819 .
I should note that I'm using an SSD. That could explain fast read/write times on my end.
Just added 18 more points to the 6m large test case in BigTests repo. This will create lots of i/o (190 pairs total).
Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml
On Tue, Nov 12, 2013 at 8:33 AM, Brad McRae mcrae@circuitscape.org wrote:
Sure, I can make such a test case. I just did some tests with a 6m problem and IO is pretty minimal. Reading maps takes 6-8 secs, writing them takes 3. This is out of ~400 seconds total. The long apparent time for current mapping is all taken up in calculating the map values themselves based on voltages. So there's probably a lot more to be gained from making that process (generating current map values) more efficient than making the writeaaigrid process more efficient.
So, speed hurdles seem to be: amg hierarchy solver creating (not writing) current maps.
Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml
On Tue, Nov 12, 2013 at 12:00 AM, Viral B. Shah notifications@github.comwrote:
Also writing is likely to be much slower than reading. The test case should be about writing lots of large current maps.
— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28274819 .
Great. This test case is good to have. If we are spending lots of time in amg hierarchy and solver, then we are doing the right thing. :-)
I will look into speeding it up, but I am not terribly hopeful.
Let's get rid of GDAL altogether. We should use numpy save
for internal usage and API calls, and savetxt
otherwise if.
Oops, I had closed this accidentally with my commit. Reopening.
@bmcrae Could you also add a few more points to the 1m BigTests, so that it will test more pairs?
Just added a 20-point test case. Let me know if you would like a different number and will add that too.
Brad McRae, Ph.D. The Nature Conservancy North America Region Tel: 541-223-1170 email: bmcrae@tnc.org http://www.nature.org/ourscience/brad-mcrae.xml
On Tue, Nov 12, 2013 at 10:18 PM, Viral B. Shah notifications@github.comwrote:
@bmcrae https://github.com/bmcrae Could you also add a few more points to the 1m BigTests, so that it will test more pairs?
— Reply to this email directly or view it on GitHubhttps://github.com/Circuitscape/Circuitscape/issues/13#issuecomment-28368810 .
GDAL used to be faster than numpy's IO routines at one point. Need to experiment again with GDAL if it is worthwhile to use, or remove the code for calling GDAL otherwise.