Open hamirmahal opened 1 year ago
https://github.com/hamirmahal/rstar-i32/actions/runs/6836250994/job/18590953607#step:3:43 also shows the output on a machine that isn't mine, although you have to be signed in to see it via this link.
I'm not sure if the error is in my code, or in rstar
. Am I using rstar
incorrectly? src/main.rs
outputs
$ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Running `target/debug/rstar-boilerplate`
thread 'main' panicked at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/arith.rs:346:1:
attempt to multiply with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
rstar uses a fold to calculate the area of the envelope (an AABB
) of the geometry (In your case, the 18-dimensional point you're trying to insert on line 78). Fold uses an accumulator and a multiplication function:
diag.fold(one, |acc, cur| {
max_inline(cur, zero) * acc
})
During the insertion operation on line 78 of your code, you can see the problem:
Acc: 359131136 Cur: 68
.
Multiplying acc
with cur
overflows i32. That's a panic in debug builds.
Ah, because the product, 24,420,917,248, is larger than the max i32
value, 2,147,483,647
.
If each coordinate is a u8
, the max possible value when multiplying them all together is 255^numCoordinates, which is currently 18, but the number of coordinates could easily double, or triple, for my use case.
So the maximum possible product currently is 255^18 = 2.078371831996e43, which I don’t even think the maximum unsigned integer size, u128
, can support, since I think that’s approximately 3.402823669209e38.
Is calculating the area of the bounding box required to use an R-tree? I’m wondering if it’s possible to write an implementation that avoids doing this calculation altogether to avoid running the risk of overflow.
Alternatively, would checking for overflow before the multiplication, and setting the product to u128::MAX
if there would be an overflow, result in any weird or unexpected behavior? I’m also considering that approach.
During the insertion operation on line 78 of your code, you can see the problem:
Acc: 359131136 Cur: 68
.Multiplying
acc
withcur
overflows i32. That's a panic in debug builds.
Hmm... where does the current value of 68
come from? I don't see any elements in https://github.com/hamirmahal/rstar-i32/commit/aef70723a73ca76c71387ef93d825b7afcf5a0ea#diff-42cb6807ad74b3e201c5a7ca98b911c5fa08380e942be6e4ac5807f8377f87fc with the value 68
.
Also, how were you able to see the current and total values in fold
?
I tried adding
diag.fold(one, |acc, cur| {
std::println!("acc: {}, cur: {}", acc, cur);
return max_inline(cur, zero) * acc;
})
to ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rstar-0.11.0/src/aabb.rs
, but that gave me this error.
ailed to resolve: use of undeclared crate or module `std`
use of undeclared crate or module `std`rustc
Thanks for the detailed response.
Just to clarify: You're using 18-dimensional coordinates, is that right?
Ok, I see your example: https://github.com/hamirmahal/rstar-i32/commit/aef70723a73ca76c71387ef93d825b7afcf5a0ea#diff-42cb6807ad74b3e201c5a7ca98b911c5fa08380e942be6e4ac5807f8377f87fc
So you are using 18-D coordinates.
Just to ask a silly question: Did you really intend to have a single coordinate with 18 dimensions, or were you trying to have something like 9 2-D (XY) coordinates?
Did you really intend to have a single coordinate with 18 dimensions, or were you trying to have something like 9 2-D (XY) coordinates?
This is correct.
I'm actually trying to find the element in my dataset with the closest Hamming distance to some input.
The things I'm comparing currently have 18 bytes, hence the 18 coordinates that can each be as large as 255
, and it looks like we may need to increase the number of bytes we store to 32, or possibly even more.
Do you think R-trees are a suitable data structure for this purpose?
Do you think R-trees are a suitable data structure for this purpose?
Well, I can say with certainty that this particular R-Tree implementation is apparently not currently suitable for this purpose. 🤣
But more seriously... I've never worked with hamming distance. The definition I searched up was:
While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different.
So, hamming distance is about bitwise distance - not euclidean distance.
In other words:
hamming_distance(0b_1000, 0b_0000) == 1 // Near!
hamming_distance(0b_1000, 0b_0111) == 4 // Far!
But with typical numeric distance we get a very different answer:
numeric_distance(0b_1000, 0b_0000) == 8 // Far!
numeric_distance(0b_1000, 0b_0111) == 1 // Near!
So intuitively, I don't understand how you'd use an rtree, which deals with euclidean numeric distance, to efficiently compute hamming distance.
Hmm. Hamming distance satisfies the triangle inequality, so I think it's possible to use R[*]-trees, but I have no idea how we're going to make it work for rstar…
Do you think R-trees are a suitable data structure for this purpose?
Well, I can say with certainty that this particular R-Tree implementation is apparently not currently suitable for this purpose. 🤣
Ha!
But more seriously... I've never worked with hamming distance. The definition I searched up was:
While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different.
So, hamming distance is about bitwise distance - not euclidean distance.
In other words:
hamming_distance(0b_1000, 0b_0000) == 1 // Near! hamming_distance(0b_1000, 0b_0111) == 4 // Far!
But with typical numeric distance we get a very different answer:
numeric_distance(0b_1000, 0b_0000) == 8 // Far! numeric_distance(0b_1000, 0b_0111) == 1 // Near!
That all sounds correct.
So intuitively, I don't understand how you'd use an rtree, which deals with euclidean numeric distance, to efficiently compute hamming distance.
I ended up doing
impl PointDistance for Data {
fn distance_2(&self, point: &[f32; 18]) -> f32 {
let hamming_distance: f32 = self
.hash
.as_slice()
.iter()
.zip(point)
.map(|(l, r)| {
// cast l and r to u8, because we know that the values are in the range [0, 255].
let l = *l as u8;
let r = *r as u8;
(l ^ r).count_ones() as f32
})
.sum();
// We must return the squared distance!
hamming_distance * hamming_distance
}
}
but it seems to be quite computationally expensive, and not very efficient.
I used f32
s instead of i32
s, even though I'm only dealing with u8
s, because rstar
doesn't support u8
s out of the box, and because I didn't get attempt to multiply with overflow
when using f32
s.
What's the hash
for?
Oh, so I'm ideally trying to find the closest Hamming distance between &self
's hash
, which is just 18 bytes, and every other hash
in my dataset.
Hm. If you're using the default Hash
impl, you're using a very high-quality (and expensive) hashing algorithm, which I would imagine is complete overkill in this case.
Ah, I see… I’ll look into some different approaches.
https://nnethercote.github.io/2021/12/08/a-brutally-effective-hash-function-in-rust.html FxHash will certainly be faster (https://crates.io/crates/fxhash)
Do we support bnum ? It supports a u256, which should suffice for above calc?
I think before we go ahead and make this crate more difficult to maintain, it should be established whether R* trees are actually suitable to querying Hamming distances.
While it could be innovative cross-pollination between different areas of mathematics, I am somewhat doubtful that the geometric intuitions underpinning the construction of R* trees transfer over to code spaces. Generally, there exist a lot of metric spaces which are decidedly weird compared to the two- or three-dimensional Euclidean space our intuition is still based on even when alternative metrics like Manhattan are used.
So, I think the first step would be to just modify the source of rstar
in a local fork to try to hack around the limits of the current design, e.g. use saturating arithmetic if that is helpful or use an high (e.g. bnum) or arbitrary (e.g. num-bigint) precision number library. Then this hacky local fork can be used to figure out if R* trees can be useful for the given problem and based on these experiences we can decide whether and how to extend this library to support this use case.
(But if the changes are too extensive, the answer might still be no as as maintainers, saying no is often what we need to do to keep complexity in check. We have also have represent the interests of our current user base which I guess is mostly GIS, games and then some simulation stuff in order of popularity.)
Personally, my first guess when reading the original problem would be to look into Levenshtein automata which are used for typo resistant search, i.e. to find a matching term with the fewest possible edits away from the query term.
Oh, I don't think anyone was actually proposing a change. Certainly when I said "make it work" I didn't mean it in the sense of modification.
@hamirmahal I think you can get it working by just converting the [u8; 18] slice into a [f64; 18] slice before storing it in the hash variable. f64 or f32, with float should support the range you need. There'll be some loss in precision in that calculation, but I think it's only used for re-balancing of the r-tree that is constructed (and so must still work, with possibly poorer performance).
@rmanoka I'll give that a look. Thanks for the suggestion.
https://github.com/hamirmahal/rstar-i32/ has some code that reproduces the issue.