georust / rstar

R*-tree spatial index for the Rust ecosystem
https://docs.rs/rstar
Apache License 2.0
384 stars 67 forks source link

`attempt to multiply with overflow` when using `i32` #138

Open hamirmahal opened 8 months ago

hamirmahal commented 8 months ago

https://github.com/hamirmahal/rstar-i32/ has some code that reproduces the issue.

hamirmahal commented 8 months ago

https://github.com/hamirmahal/rstar-i32/actions/runs/6836250994/job/18590953607#step:3:43 also shows the output on a machine that isn't mine, although you have to be signed in to see it via this link.

hamirmahal commented 8 months ago

I'm not sure if the error is in my code, or in rstar. Am I using rstar incorrectly? src/main.rs outputs

$  cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/rstar-boilerplate`
thread 'main' panicked at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/arith.rs:346:1:
attempt to multiply with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Rust code that resulted in the above output
```rust use rstar::Point; #[derive(Clone, Copy, Debug, PartialEq)] struct Data { coordinates: [i32; 18], } impl Point for Data { type Scalar = i32; const DIMENSIONS: usize = 18; fn nth(&self, index: usize) -> Self::Scalar { self.coordinates[index] } fn nth_mut(&mut self, index: usize) -> &mut Self::Scalar { &mut self.coordinates[index] } fn generate(mut generator: impl FnMut(usize) -> Self::Scalar) -> Self { Data { coordinates: [ generator(0), generator(1), generator(2), generator(3), generator(4), generator(5), generator(6), generator(7), generator(8), generator(9), generator(10), generator(11), generator(12), generator(13), generator(14), generator(15), generator(16), generator(17), ], } } } fn main() { let mut tree = rstar::RTree::new(); tree.insert(Data { coordinates: [ 30, 30, 62, 30, 31, 30, 20, 12, 30, 80, 40, 132, 28, 78, 140, 139, 195, 67, ], }); tree.insert(Data { coordinates: [ 40, 9, 9, 144, 136, 149, 148, 140, 198, 36, 50, 167, 179, 234, 38, 2, 103, 38, ], }); tree.insert(Data { coordinates: [ 38, 38, 38, 6, 2, 3, 155, 61, 61, 195, 155, 15, 70, 134, 158, 126, 94, 63, ], }); tree.insert(Data { coordinates: [ 154, 154, 138, 179, 121, 75, 143, 31, 7, 67, 11, 3, 113, 113, 65, 65, 73, 65, ], }); tree.insert(Data { coordinates: [ 221, 215, 209, 115, 210, 198, 224, 236, 111, 6, 7, 7, 85, 92, 203, 197, 36, 44, ], }); tree.insert(Data { coordinates: [ 176, 240, 176, 176, 112, 241, 240, 240, 192, 44, 44, 12, 206, 204, 142, 4, 44, 2, ], }); tree.insert(Data { coordinates: [ 212, 218, 202, 70, 70, 23, 23, 23, 1, 230, 250, 55, 30, 23, 60, 60, 92, 16, ], }); } ```
urschrei commented 8 months ago

rstar uses a fold to calculate the area of the envelope (an AABB) of the geometry (In your case, the 18-dimensional point you're trying to insert on line 78). Fold uses an accumulator and a multiplication function:

diag.fold(one, |acc, cur| {
    max_inline(cur, zero) * acc
})

During the insertion operation on line 78 of your code, you can see the problem:

Acc: 359131136 Cur: 68.

Multiplying acc with cur overflows i32. That's a panic in debug builds.

hamirmahal commented 8 months ago

Ah, because the product, 24,420,917,248, is larger than the max i32 value, 2,147,483,647.

If each coordinate is a u8, the max possible value when multiplying them all together is 255^numCoordinates, which is currently 18, but the number of coordinates could easily double, or triple, for my use case.

So the maximum possible product currently is 255^18 = 2.078371831996e43, which I don’t even think the maximum unsigned integer size, u128, can support, since I think that’s approximately 3.402823669209e38.

Is calculating the area of the bounding box required to use an R-tree? I’m wondering if it’s possible to write an implementation that avoids doing this calculation altogether to avoid running the risk of overflow.

Alternatively, would checking for overflow before the multiplication, and setting the product to u128::MAX if there would be an overflow, result in any weird or unexpected behavior? I’m also considering that approach.

During the insertion operation on line 78 of your code, you can see the problem:

Acc: 359131136 Cur: 68.

Multiplying acc with cur overflows i32. That's a panic in debug builds.

Hmm... where does the current value of 68 come from? I don't see any elements in https://github.com/hamirmahal/rstar-i32/commit/aef70723a73ca76c71387ef93d825b7afcf5a0ea#diff-42cb6807ad74b3e201c5a7ca98b911c5fa08380e942be6e4ac5807f8377f87fc with the value 68.

Also, how were you able to see the current and total values in fold?

I tried adding

        diag.fold(one, |acc, cur| {
            std::println!("acc: {}, cur: {}", acc, cur);
            return max_inline(cur, zero) * acc;
        })

to ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rstar-0.11.0/src/aabb.rs, but that gave me this error.

ailed to resolve: use of undeclared crate or module `std`
use of undeclared crate or module `std`rustc

Thanks for the detailed response.

michaelkirk commented 8 months ago

Just to clarify: You're using 18-dimensional coordinates, is that right?

michaelkirk commented 8 months ago

Ok, I see your example: https://github.com/hamirmahal/rstar-i32/commit/aef70723a73ca76c71387ef93d825b7afcf5a0ea#diff-42cb6807ad74b3e201c5a7ca98b911c5fa08380e942be6e4ac5807f8377f87fc

So you are using 18-D coordinates.

Just to ask a silly question: Did you really intend to have a single coordinate with 18 dimensions, or were you trying to have something like 9 2-D (XY) coordinates?

hamirmahal commented 8 months ago

Did you really intend to have a single coordinate with 18 dimensions, or were you trying to have something like 9 2-D (XY) coordinates?

This is correct.

hamirmahal commented 8 months ago

I'm actually trying to find the element in my dataset with the closest Hamming distance to some input.

The things I'm comparing currently have 18 bytes, hence the 18 coordinates that can each be as large as 255, and it looks like we may need to increase the number of bytes we store to 32, or possibly even more.

Do you think R-trees are a suitable data structure for this purpose?

michaelkirk commented 8 months ago

Do you think R-trees are a suitable data structure for this purpose?

Well, I can say with certainty that this particular R-Tree implementation is apparently not currently suitable for this purpose. 🤣

But more seriously... I've never worked with hamming distance. The definition I searched up was:

While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different.

So, hamming distance is about bitwise distance - not euclidean distance.

In other words:

hamming_distance(0b_1000, 0b_0000) == 1  // Near!
hamming_distance(0b_1000, 0b_0111) == 4  // Far!

But with typical numeric distance we get a very different answer:

numeric_distance(0b_1000, 0b_0000) == 8  // Far!
numeric_distance(0b_1000, 0b_0111) == 1  // Near!

So intuitively, I don't understand how you'd use an rtree, which deals with euclidean numeric distance, to efficiently compute hamming distance.

urschrei commented 8 months ago

Hmm. Hamming distance satisfies the triangle inequality, so I think it's possible to use R[*]-trees, but I have no idea how we're going to make it work for rstar…

hamirmahal commented 8 months ago

Do you think R-trees are a suitable data structure for this purpose?

Well, I can say with certainty that this particular R-Tree implementation is apparently not currently suitable for this purpose. 🤣

Ha!

But more seriously... I've never worked with hamming distance. The definition I searched up was:

While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different.

So, hamming distance is about bitwise distance - not euclidean distance.

In other words:

hamming_distance(0b_1000, 0b_0000) == 1  // Near!
hamming_distance(0b_1000, 0b_0111) == 4  // Far!

But with typical numeric distance we get a very different answer:

numeric_distance(0b_1000, 0b_0000) == 8  // Far!
numeric_distance(0b_1000, 0b_0111) == 1  // Near!

That all sounds correct.

So intuitively, I don't understand how you'd use an rtree, which deals with euclidean numeric distance, to efficiently compute hamming distance.

I ended up doing

impl PointDistance for Data {
    fn distance_2(&self, point: &[f32; 18]) -> f32 {
        let hamming_distance: f32 = self
            .hash
            .as_slice()
            .iter()
            .zip(point)
            .map(|(l, r)| {
                // cast l and r to u8, because we know that the values are in the range [0, 255].
                let l = *l as u8;
                let r = *r as u8;
                (l ^ r).count_ones() as f32
            })
            .sum();

        // We must return the squared distance!
        hamming_distance * hamming_distance
    }
}

but it seems to be quite computationally expensive, and not very efficient.

I used f32s instead of i32s, even though I'm only dealing with u8s, because rstar doesn't support u8s out of the box, and because I didn't get attempt to multiply with overflow when using f32s.

urschrei commented 8 months ago

What's the hash for?

hamirmahal commented 8 months ago

Oh, so I'm ideally trying to find the closest Hamming distance between &self's hash, which is just 18 bytes, and every other hash in my dataset.

urschrei commented 8 months ago

Hm. If you're using the default Hash impl, you're using a very high-quality (and expensive) hashing algorithm, which I would imagine is complete overkill in this case.

hamirmahal commented 8 months ago

Ah, I see… I’ll look into some different approaches.

urschrei commented 8 months ago

https://nnethercote.github.io/2021/12/08/a-brutally-effective-hash-function-in-rust.html FxHash will certainly be faster (https://crates.io/crates/fxhash)

rmanoka commented 8 months ago

Do we support bnum ? It supports a u256, which should suffice for above calc?

adamreichold commented 8 months ago

I think before we go ahead and make this crate more difficult to maintain, it should be established whether R* trees are actually suitable to querying Hamming distances.

While it could be innovative cross-pollination between different areas of mathematics, I am somewhat doubtful that the geometric intuitions underpinning the construction of R* trees transfer over to code spaces. Generally, there exist a lot of metric spaces which are decidedly weird compared to the two- or three-dimensional Euclidean space our intuition is still based on even when alternative metrics like Manhattan are used.

So, I think the first step would be to just modify the source of rstar in a local fork to try to hack around the limits of the current design, e.g. use saturating arithmetic if that is helpful or use an high (e.g. bnum) or arbitrary (e.g. num-bigint) precision number library. Then this hacky local fork can be used to figure out if R* trees can be useful for the given problem and based on these experiences we can decide whether and how to extend this library to support this use case.

(But if the changes are too extensive, the answer might still be no as as maintainers, saying no is often what we need to do to keep complexity in check. We have also have represent the interests of our current user base which I guess is mostly GIS, games and then some simulation stuff in order of popularity.)

Personally, my first guess when reading the original problem would be to look into Levenshtein automata which are used for typo resistant search, i.e. to find a matching term with the fewest possible edits away from the query term.

urschrei commented 8 months ago

Oh, I don't think anyone was actually proposing a change. Certainly when I said "make it work" I didn't mean it in the sense of modification.

rmanoka commented 8 months ago

@hamirmahal I think you can get it working by just converting the [u8; 18] slice into a [f64; 18] slice before storing it in the hash variable. f64 or f32, with float should support the range you need. There'll be some loss in precision in that calculation, but I think it's only used for re-balancing of the r-tree that is constructed (and so must still work, with possibly poorer performance).

hamirmahal commented 8 months ago

@rmanoka I'll give that a look. Thanks for the suggestion.