Ripser / ripser

Ripser: efficient computation of Vietoris–Rips persistence barcodes
http://ripser.org
MIT License
311 stars 75 forks source link

Overflow error on 3D datasets #38

Closed pierre-guillou closed 2 years ago

pierre-guillou commented 3 years ago

Hello and thank you for your work!

I am trying to use Ripser to compute persistence diagrams of rather large 3D datasets, coming from the Open-SciVis-Datasets website.

However, Ripser throws a std::overflow_error when I try to compute persistence pairs up to the dimension 2. If I understand correctly the C++ code, this exception is related to the computation of binomial coefficients which depend on the size of the input distance matrix and the maximum dimension requested.

This issue might be related to https://github.com/Ripser/ripser/issues/25 and https://github.com/Ripser/ripser/issues/32 since they feature similar problems. I already opened an issue in the scikit-tda Python wrapper (I wanted to use Ripser via this Python API) but since this is more related to the C++ code, this issue seems better to belong here.

Below, you will find a Python script I used to trigger this overflow error. It takes a raw file from Open-Scivis-Datasets and iterates over the edges of the cubical complex to generate a sparse matrix in sparse triplet format that is fed to the Ripser executable. Although the diagram is computed as expected with the smallest datasets (nucleon, marschner_lobb, silicium, up to 120k vertices), the error occurs with fuel, neghip (more than 250k vertices) and every larger dataset.

It there a way to circumvent this computation of these binomial coefficients to handle large datasets (up to 1M vertices)?

Thanks in advance for your help, Best regards, Pierre Guillou

import argparse
import subprocess
import time

import numpy as np

def load_raw(input_raw):
    extension = input_raw.split(".")[-1]
    if extension != "raw":
        print("Need a .raw file")
        raise TypeError

    # detect extent and data type from file name
    extent, dtype = input_raw.split(".")[0].split("_")[-2:]
    extent = [int(dim) for dim in extent.split("x")]

    dtype_np = {
        "uint8": np.uint8,
        "int16": np.int16,
        "uint16": np.uint16,
        "float32": np.float32,
        "float64": np.float64,
    }
    dtype = dtype_np[dtype]

    with open(input_raw) as src:
        data = np.fromfile(src, dtype=dtype)
        return data.reshape(extent)

def compute_persistence(sparse_triplets, output_diagram, ripser_executable):
    proc = subprocess.Popen(
        [ripser_executable, "--format", "sparse", "--dim", "2"],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        universal_newlines=True,
    )
    out, _ = proc.communicate(sparse_triplets)

    with open(output_diagram, "w") as dst:
        dst.write(out)

def build_sparse_triplets(data):
    triplets = list()
    # loop over every edge of the cubical complex to get the sparse matrix triplets
    with np.nditer(data, flags=["multi_index"]) as it:
        for v in it:
            i = np.ravel_multi_index(it.multi_index, data.shape)
            for d in range(data.ndim):
                for s in [-1, 1]:
                    coords = list(it.multi_index)
                    s = coords[d] + s
                    if s < 0 or s >= data.shape[d]:
                        continue
                    coords[d] = s
                    j = np.ravel_multi_index(coords, data.shape)
                    if i < j:  # upper triangle distance matrix
                        triplets.append([i, j, max(data.flat[j], v)])

    return "\n".join([f"{i} {j} {v}" for i, j, v in triplets])

def main(input_raw, output_diagram, ripser_executable):
    data = load_raw(input_raw)
    start = time.time()
    sparse_triplets = build_sparse_triplets(data)
    print(f"Build sparse triplets in {time.time() - start:.2f}s")
    compute_persistence(sparse_triplets, output_diagram, ripser_executable)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Wrapper around Ripser C++")
    parser.add_argument("input_raw", help="Input raw file from OpenSciVis datasets")
    parser.add_argument(
        "-o",
        "--output_diagram",
        help="Output diagram in Ripser format",
        default="out.ripser",
    )
    parser.add_argument(
        "-e",
        "--ripser_executable",
        help="Path to the Ripser executable",
        default="ripser/ripser",
    )
    args = parser.parse_args()

    main(args.input_raw, args.output_diagram, args.ripser_executable)
MassEast commented 2 years ago

Can you also use Ripser on high dimensional data, like 10D?

ulupo commented 2 years ago

Homology dimension k computations for a dataset of size N require that it be possible in principle to enumerate all (k+1)-simplices of the full simplex with N vertices. In your case (k=2) this means enumerating all 3-simplices (which are 4-tuples of vertices), and this requires that the binomial coefficient (N choose 4) not exceed 2^63 - 1 (the maximum index for signed 64-bit integers). With N=250000, N choose 4 is approximately 1.6e20, while 2^63 is about 9.2e18.

[Edit: I may be off by 1 in some of the above, but the main point remains.]

After modifying the backend to work with 128-bit integers, you should be able to run Ripser on your dataset. Both @ubauer and @MonkeyBreaker have tried this at some point AFAIK. It can be done!

ulupo commented 2 years ago

@MassEast the dimensionality of the data is not an obstacle, Ripser simply looks at the matrix of pairwise distances between all pairs of points in the dataset. This takes a little longer to compute in higher dimensions, but only marginally so, especially compared to the time it takes to compute persistence.

julien-tierny commented 6 months ago

After modifying the backend to work with 128-bit integers, you should be able to run Ripser on your dataset. Both @ubauer and @MonkeyBreaker have tried this at some point AFAIK. It can be done!

Thanks @ulupo for the information. @ubauer @MonkeyBreaker would it be possible that you post the diff to get ripser running with 128-bit integers (it doesn't seem like a one-line edit). Thanks!

ubauer commented 6 months ago

There is a branch at https://github.com/Ripser/ripser/tree/128bit implementing this. It should be mostly up to date, and it's actually pretty much a one line edit. Since 128 bit ints seem to be non standardized, this might not work out of the box for some compilers.

julien-tierny commented 6 months ago

Thanks!