Closed seisman closed 6 months ago
Just tried to see if getting rid of temporary files can make PyGMT faster. The result are quit promising. For grd2xyz
, PR #2729 is ~35% faster (130 ms in PR #2729 vs 200 ms in the main branch):
For the main branch:
In [1]: import pygmt
In [2]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
200 ms ± 3.61 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [3]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
213 ms ± 17.9 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [4]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
217 ms ± 6.09 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
For PR #2729:
In [2]: import pygmt
In [3]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
146 ms ± 15.6 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [4]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
132 ms ± 1.81 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [5]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
131 ms ± 1.42 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
Nice, good to see those benchmarks! I've been wondering if we should use pytest-benchmark to get an idea of speedup over time. There's this pytest-codspeed
/CodSpeedHQ/action tool that would allow us to track these benchmarks over time, but downside is:
@pytest.mark.benchmark
Maybe we can selectively marking some of the tests we want to track from the list above at https://github.com/GenericMappingTools/pygmt/issues/2730#issue-193332360? I can try to set up the CI for this in the coming weekend.
I've been wondering if we should use pytest-benchmark to get an idea of speedup over time. There's this
pytest-codspeed
/CodSpeedHQ/action tool that would allow us to track these benchmarks over time, but downside is:
- We'll need to manually mark the tests to benchmark with
@pytest.mark.benchmark
- The tests might run slower, because the benchmarks runs a single test multiple times to get the mean/standard deviation values.
Sounds interesting. Manually marking tests sounds OK to me. For your point 2, since performance is not a high-priority issue, maybe we only do the benchmarks in the main branch. Technically, it's possible to override the @pytest.mark.benchmark
marker, so that it does nothing if not using CI in the main branch.
Here are some more completed benchmarks:
import numpy as np
import pandas as pd
from pygmt.clib import Session
from pygmt.helpers import GMTTempFile
# Create a bunch of test files with different number of nrows
for nrows in [1, 10, 100, 1000, 10000, 100000, 1000000]:
data = np.random.random((nrows, 3))
np.savetxt(fname=f"test-{nrows}.txt", X=data)
def tmpfile(fname):
"""
The old way: write to a temporary file and then read it using pd.read_csv.
"""
with Session() as lib:
with GMTTempFile(suffix=".txt") as tmpfile:
lib.call_module("read", f"{fname} {tmpfile.name} -Td")
df = pd.read_csv(tmpfile.name, sep=" ", comment=">")
def vfile(fname):
"""
The new way: write to a virtual file and then convert it to pandas.
"""
with Session() as lib:
with lib.virtualfile_out(kind="dataset") as vouttbl:
lib.call_module("read", f"{fname} {vouttbl} -Td")
df = lib.virtualfile_to_dataset(output_type="pandas", vfname=vouttbl)
Calling the above functions using IPython's magic command %timeit
:
for nrows in [1, 10, 100, 1000, 10000, 100000, 1000000]:
print(f"ncols={nrows}")
fname = f"test-{nrows}.txt"
%timeit tmpfile(fname)
%timeit vfile(fname)
print()
Here are the outputs (on macOS):
nrows=1
18.9 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.5 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
nrows=10
18.4 ms ± 3.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.3 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
nrows=100
17.7 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.6 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
nrows=1000
22.3 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.3 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
nrows=10000
41.1 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
23 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
nrows=100000
304 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
116 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
nrows=1000000
2.94 s ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
862 ms ± 61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, we can conclude that the vfile
method is faster than the tmpfile
method, specially for large data files.
- [ ]
x2sys_cross
I had a look at refactoring x2sys_cross
to use virtualfiles instead of temporary files, but it's a little tricky because:
Output: The virtualfile_to_dataset
method from #3083 was able to produce a pandas.DataFrame
output, but the column names were missing. The logic for handling x2sys_cross
's output is actually complicated:
Important things to handle are:
datetime64
dtypex2sys_cross
may output multi-segment parts (see https://docs.generic-mapping-tools.org/6.5/supplements/x2sys/x2sys_cross.html#remarks) when multiple tracks are passed in and -Qe
(external COEs) is selected. Unsure how this is handled in GMT virtualfiles (note that we actually just merge all the multi-segments into one table when pandas.DataFrame output is selected, output to file will preserve the segments though).z_X
/z_M
or z_1
/z_2
depending on whether trackvalues/-Z
argument is set.It should be possible to handle 1 and 3 somehow, but I'm not so sure about 2 since it will involve checking how GMT outputs virtualfiles in x2sys_cross
. We'll need to do some careful checking to ensure the refactoring doesn't modify the output and makes it incorrect.
- Last two column names can either be
z_X
/z_M
orz_1
/z_2
depending on whether trackvalues/-Z
argument is set.
The column names are available as table header in GMT_DATASET. So we can parse the table header and get the column names automatically. See https://github.com/GenericMappingTools/pygmt/pull/3117/files for a proof of concept. We just need to borrow some ideas from the pd.read_csv
method and carefully design the API.
I'm closing the issue since we have refactored most wrappers using virtual files for output. The only exceptions are info
/grdinfo
/x2sys_cross
/grdcut
and will be tracked in their own issue/PRs.
One of the project goals is to interface with GMT C API directly using ctypes without system calls. It lets us work on data in memory, which is more efficient than working on files on disk.
However, currently PyGMT still writes tables/grids/CPTs into temporary files. With PRs #2729 and #2398 (not merged yet), we can write data into memory and then working on these data-in-memory directly. Thus, it's possible to get rid of temporary files completely.
This issue report is the central place to track all the functions/methods that need to be refactored.
Modules writting tables
blockm*
#3103filter1d
#3085grdinfo
[Will be tracked in #593]grdtrack
#3106triangulate
#3107grdvolume
#3102select
#3108which
#3148grdhisteq
#3109x2sys_cross
[Will be tracked in #3160]info
[Will be tracked in #3159]project
#3110grd2xyz
#3097Modules writting grids
Most modules are refactored in #2398, except
grdcut
.dimfilter
grdclip
sphinterpolate
grdsample
grdfilter
grdproject
grdgradient
grdfill
triangulate
sphdistance
xyz2grd
grdcut
[#3115]grdhisteq
nearneighbor
sph2grd
grdlandmask
binstats
surface