Get rid of temporary files from pygmt functions and plotting methods

GenericMappingTools / pygmt

A Python interface for the Generic Mapping Tools.

https://www.pygmt.org

BSD 3-Clause "New" or "Revised" License

758 stars 220 forks source link

Get rid of temporary files from pygmt functions and plotting methods #2730

Closed seisman closed 6 months ago

seisman commented 1 year ago

One of the project goals is to interface with GMT C API directly using ctypes without system calls. It lets us work on data in memory, which is more efficient than working on files on disk.

However, currently PyGMT still writes tables/grids/CPTs into temporary files. With PRs #2729 and #2398 (not merged yet), we can write data into memory and then working on these data-in-memory directly. Thus, it's possible to get rid of temporary files completely.

This issue report is the central place to track all the functions/methods that need to be refactored.

Modules writting tables

[x] blockm* #3103
[x] filter1d #3085
[ ] grdinfo [Will be tracked in #593]
[x] grdtrack #3106
[x] triangulate #3107
[x] grdvolume #3102
[x] select #3108
[x] which #3148
[x] grdhisteq #3109
[ ] x2sys_cross [Will be tracked in #3160]
[ ] info [Will be tracked in #3159]
[x] project #3110
[x] grd2xyz #3097

Modules writting grids

Most modules are refactored in #2398, except grdcut.

[x] dimfilter
[x] grdclip
[x] sphinterpolate
[x] grdsample
[x] grdfilter
[x] grdproject
[x] grdgradient
[x] grdfill
[x] triangulate
[x] sphdistance
[x] xyz2grd
[ ] grdcut [#3115]
[x] grdhisteq
[x] nearneighbor
[x] sph2grd
[x] grdlandmask
[x] binstats
[x] surface

seisman commented 10 months ago

Just tried to see if getting rid of temporary files can make PyGMT faster. The result are quit promising. For grd2xyz, PR #2729 is ~35% faster (130 ms in PR #2729 vs 200 ms in the main branch):

For the main branch:

In [1]: import pygmt

In [2]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
200 ms ± 3.61 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [3]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
213 ms ± 17.9 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [4]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
217 ms ± 6.09 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

For PR #2729:

In [2]: import pygmt

In [3]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
146 ms ± 15.6 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [4]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
132 ms ± 1.81 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [5]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
131 ms ± 1.42 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

weiji14 commented 10 months ago

Nice, good to see those benchmarks! I've been wondering if we should use pytest-benchmark to get an idea of speedup over time. There's this pytest-codspeed/CodSpeedHQ/action tool that would allow us to track these benchmarks over time, but downside is:

We'll need to manually mark the tests to benchmark with @pytest.mark.benchmark
~The tests might run slower, because the benchmarks runs a single test multiple times to get the mean/standard deviation values.~ Edit: Actually no, codspeed only runs each benchmark once, see https://codspeed.io/blog/pinpoint-performance-regressions-with-ci-integrated-differential-profiling.

Maybe we can selectively marking some of the tests we want to track from the list above at https://github.com/GenericMappingTools/pygmt/issues/2730#issue-193332360? I can try to set up the CI for this in the coming weekend.

seisman commented 10 months ago

I've been wondering if we should use pytest-benchmark to get an idea of speedup over time. There's this pytest-codspeed/CodSpeedHQ/action tool that would allow us to track these benchmarks over time, but downside is:

We'll need to manually mark the tests to benchmark with @pytest.mark.benchmark

The tests might run slower, because the benchmarks runs a single test multiple times to get the mean/standard deviation values.

Sounds interesting. Manually marking tests sounds OK to me. For your point 2, since performance is not a high-priority issue, maybe we only do the benchmarks in the main branch. Technically, it's possible to override the @pytest.mark.benchmark marker, so that it does nothing if not using CI in the main branch.

seisman commented 8 months ago

Here are some more completed benchmarks:

import numpy as np
import pandas as pd
from pygmt.clib import Session
from pygmt.helpers import GMTTempFile

# Create a bunch of test files with different number of nrows
for nrows in [1, 10, 100, 1000, 10000, 100000, 1000000]:
    data = np.random.random((nrows, 3))
    np.savetxt(fname=f"test-{nrows}.txt", X=data)

def tmpfile(fname):
    """
    The old way: write to a temporary file and then read it using pd.read_csv.
    """
    with Session() as lib:
        with GMTTempFile(suffix=".txt") as tmpfile:
            lib.call_module("read", f"{fname} {tmpfile.name} -Td")
            df = pd.read_csv(tmpfile.name, sep=" ", comment=">")

def vfile(fname):
    """
    The new way: write to a virtual file and then convert it to pandas.
    """
    with Session() as lib:
        with lib.virtualfile_out(kind="dataset") as vouttbl:
            lib.call_module("read", f"{fname} {vouttbl} -Td")
            df = lib.virtualfile_to_dataset(output_type="pandas", vfname=vouttbl)

Calling the above functions using IPython's magic command %timeit:

for nrows in [1, 10, 100, 1000, 10000, 100000, 1000000]:
    print(f"ncols={nrows}")
    fname = f"test-{nrows}.txt"
    %timeit tmpfile(fname)
    %timeit vfile(fname)
    print()

Here are the outputs (on macOS):

nrows=1
18.9 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.5 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

nrows=10
18.4 ms ± 3.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.3 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nrows=100
17.7 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.6 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nrows=1000
22.3 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.3 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nrows=10000
41.1 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
23 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

nrows=100000
304 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
116 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

nrows=1000000
2.94 s ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
862 ms ± 61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, we can conclude that the vfile method is faster than the tmpfile method, specially for large data files.

weiji14 commented 8 months ago

[ ] x2sys_cross

I had a look at refactoring x2sys_cross to use virtualfiles instead of temporary files, but it's a little tricky because:

Input: Cannot pass in virtualfiles as input as mentioned at https://github.com/GenericMappingTools/pygmt/pull/546#discussion_r459366392 and https://github.com/GenericMappingTools/gmt/issues/3717, since GMT doesn't support virtualfiles to X2SYS modules
Output: The virtualfile_to_dataset method from #3083 was able to produce a pandas.DataFrame output, but the column names were missing. The logic for handling x2sys_cross's output is actually complicated:

https://github.com/GenericMappingTools/pygmt/blob/bcbbcad1ffca6f73fcc8ebc2872f228aaaa2266e/pygmt/src/x2sys_cross.py#L231-L250

Important things to handle are:

Datetime columns need to be parsed correctly as datetime64 dtype
x2sys_cross may output multi-segment parts (see https://docs.generic-mapping-tools.org/6.5/supplements/x2sys/x2sys_cross.html#remarks) when multiple tracks are passed in and -Qe (external COEs) is selected. Unsure how this is handled in GMT virtualfiles (note that we actually just merge all the multi-segments into one table when pandas.DataFrame output is selected, output to file will preserve the segments though).
Last two column names can either be z_X/z_M or z_1/z_2 depending on whether trackvalues/-Z argument is set.

It should be possible to handle 1 and 3 somehow, but I'm not so sure about 2 since it will involve checking how GMT outputs virtualfiles in x2sys_cross. We'll need to do some careful checking to ensure the refactoring doesn't modify the output and makes it incorrect.

seisman commented 7 months ago

Last two column names can either be z_X/z_M or z_1/z_2 depending on whether trackvalues/-Z argument is set.

The column names are available as table header in GMT_DATASET. So we can parse the table header and get the column names automatically. See https://github.com/GenericMappingTools/pygmt/pull/3117/files for a proof of concept. We just need to borrow some ideas from the pd.read_csv method and carefully design the API.

seisman commented 6 months ago

I'm closing the issue since we have refactored most wrappers using virtual files for output. The only exceptions are info/grdinfo/x2sys_cross/grdcut and will be tracked in their own issue/PRs.