pchampin commented 1 year ago

This is an attempt at improving the Sophia implementation in HdtGraph (and also the implementation of Hdt).

The general idea is to use Arc instead of MownStr, because Arc can be cloned without actually copying the underlying text, and do not require a persistent "owner" of the text.

The performance gain is not obvious, because what takes most time is FourSectDict::id_to_string, i.e. the reconstruction of the string from the compressed dictionary. But it still seems that by using Arc we spend less allocating memory than when using MownStr.

This PR also contains a number of minor improvements in the use of the Sophia API.

KonradHoeffner commented 1 year ago

Benchmarks

before

S??/1.1 (vincent, ?, ?) triple IDs
                        time:   [1.5088 µs 1.5094 µs 1.5101 µs]
S??/1.2 (vincent, ?, ?) str triples
                        time:   [3.2658 µs 3.2679 µs 3.2701 µs]
S??/1.3 (vincent, ?, ?) Sophia triples
                        time:   [4.5061 µs 4.5081 µs 4.5100 µs]

?P? 1429618 triples/2.1 (?, type, ?) triple IDs
                        time:   [347.29 ms 347.52 ms 347.80 ms]
?P? 1429618 triples/2.2 (?, type, ?) str triples
                        time:   [619.79 ms 620.15 ms 620.50 ms]
?P? 1429618 triples/2.3 (?, type, ?) Sophia triples
                        time:   [651.66 ms 651.96 ms 652.25 ms]

??O 1414214 triples/3.1 (?, ?, person) triple IDs
                        time:   [139.74 ms 139.79 ms 139.84 ms]
??O 1414214 triples/3.2 (?, ?, person) str triples
                        time:   [469.73 ms 469.80 ms 469.87 ms]
??O 1414214 triples/3.3 (?, ?, person) Sophia triples
                        time:   [495.47 ms 495.55 ms 495.62 ms]

?PO 1414214 triples/4.1 (?, type, person) triple IDs
                        time:   [117.97 ms 118.02 ms 118.08 ms]
?PO 1414214 triples/4.2 (?, type, person) str subjects
                        time:   [312.98 ms 313.03 ms 313.08 ms]
?PO 1414214 triples/4.3 (?, type, person) str triples
                        time:   [323.11 ms 323.24 ms 323.38 ms]
?PO 1414214 triples/4.4 (?, type, person) Sophia triples
                        time:   [351.52 ms 351.66 ms 351.81 ms]

after

S??/1.1 (vincent, ?, ?) triple IDs
                        time:   [1.4797 µs 1.4815 µs 1.4835 µs]
                        change: [-1.7325% -1.5302% -1.3180%] (p = 0.00 < 0.05)
                        Performance has improved.
S??/1.2 (vincent, ?, ?) str triples
                        time:   [3.5434 µs 3.5484 µs 3.5539 µs]
                        change: [+8.9300% +9.1300% +9.3659%] (p = 0.00 < 0.05)
                        Performance has regressed.
S??/1.3 (vincent, ?, ?) Sophia triples
                        time:   [4.1382 µs 4.1424 µs 4.1472 µs]
                        change: [-8.1291% -7.9905% -7.8638%] (p = 0.00 < 0.05)
                        Performance has improved.

?P? 1429618 triples/2.1 (?, type, ?) triple IDs
                        time:   [347.97 ms 348.47 ms 349.09 ms]
                        change: [+0.1106% +0.2719% +0.4549%] (p = 0.01 < 0.05)
                        Change within noise threshold.
?P? 1429618 triples/2.2 (?, type, ?) str triples
                        time:   [649.61 ms 650.02 ms 650.51 ms]
                        change: [+4.7266% +4.8163% +4.9268%] (p = 0.00 < 0.05)
                        Performance has regressed.
?P? 1429618 triples/2.3 (?, type, ?) Sophia triples
                        time:   [724.92 ms 726.32 ms 728.73 ms]
                        change: [+11.177% +11.406% +11.761%] (p = 0.00 < 0.05)
                        Performance has regressed.

??O 1414214 triples/3.1 (?, ?, person) triple IDs
                        time:   [140.49 ms 140.60 ms 140.71 ms]
                        change: [+0.4949% +0.5784% +0.6626%] (p = 0.00 < 0.05)
                        Change within noise threshold.
??O 1414214 triples/3.2 (?, ?, person) str triples
                        time:   [508.25 ms 509.08 ms 509.98 ms]
                        change: [+8.1956% +8.3621% +8.5548%] (p = 0.00 < 0.05)
                        Performance has regressed.
??O 1414214 triples/3.3 (?, ?, person) Sophia triples
                        time:   [569.15 ms 569.64 ms 570.18 ms]
                        change: [+14.865% +14.951% +15.058%] (p = 0.00 < 0.05)
                        Performance has regressed.

?PO 1414214 triples/4.1 (?, type, person) triple IDs
                        time:   [118.33 ms 118.59 ms 118.97 ms]
                        change: [+0.2940% +0.4841% +0.6997%] (p = 0.00 < 0.05)
                        Change within noise threshold.
?PO 1414214 triples/4.2 (?, type, person) str subjects
                        time:   [315.39 ms 315.82 ms 316.28 ms]
                        change: [+0.7473% +0.8894% +1.0319%] (p = 0.00 < 0.05)
                        Change within noise threshold.
?PO 1414214 triples/4.3 (?, type, person) str triples
                        time:   [344.67 ms 344.81 ms 344.95 ms]
                        change: [+6.6081% +6.6732% +6.7300%] (p = 0.00 < 0.05)
                        Performance has regressed.
?PO 1414214 triples/4.4 (?, type, person) Sophia triples
                        time:   [366.34 ms 367.00 ms 367.65 ms]
                        change: [+4.1570% +4.3632% +4.5522%] (p = 0.00 < 0.05)
                        Performance has regressed.

KonradHoeffner commented 1 year ago

Benchmarking ??? pattern (all triples) on i9-10900k

Before

Benchmarking ??? (all)/0.1 all triple IDs: Warming up for 3.0000 s
??? (all)/0.1 all triple IDs
                        time:   [1.6171 s 1.6180 s 1.6189 s]
??? (all)/0.2 all str triples
                        time:   [6.3724 s 6.3750 s 6.3785 s]
??? (all)/0.3 all Sophia triples
                        time:   [7.8375 s 7.8492 s 7.8624 s]

After

??? (all)/0.1 all triple IDs
                        time:   [1.6182 s 1.6336 s 1.6480 s]
                        change: [+0.1297% +0.9638% +1.8168%] (p = 0.07 > 0.05)
                        No change in performance detected.
??? (all)/0.2 all str triples
                        time:   [7.1611 s 7.2483 s 7.3353 s]
                        change: [+12.424% +13.699% +14.932%] (p = 0.00 < 0.05)
                        Performance has regressed.
??? (all)/0.3 all Sophia triples
                        time:   [8.5335 s 8.5736 s 8.6102 s]
                        change: [+8.7127% +9.2289% +9.6978%] (p = 0.00 < 0.05)
                        Performance has regressed.

Comparison

It seems to be around 10% slower even when the triple cache is used, which surprises me. Triple ID measurements didn't change which is expected because the code for them hasn't changed as well.

KonradHoeffner commented 1 year ago

Maybe the idx variable needs to be updated in the else part of the get_x_string function?

KonradHoeffner commented 1 year ago

    fn get_x_string(&mut self, i: usize, pos: usize, kind: &'static IdKind) -> Result<Arc<str>, DictErr> {
        debug_assert!(i != 0);
        if self.idx[pos] == i {
            Ok(self.arc[pos].as_ref().unwrap().clone())
        } else {
            let ret: Arc<str> = self.hdt.dict.id_to_string(i, kind)?.into();
            self.arc[pos] = Some(ret.clone());
+          self.idx[pos] = i;
            Ok(ret)
        }

??? (all)/0.1 all triple IDs
                        time:   [1.6308 s 1.6329 s 1.6366 s]
                        change: [+0.7702% +0.9257% +1.1700%] (p = 0.00 < 0.05)
                        Change within noise threshold.
??? (all)/0.2 all str triples
                        time:   [5.8885 s 5.8956 s 5.9026 s]
                        change: [-7.6339% -7.5201% -7.3964%] (p = 0.00 < 0.05)
                        Performance has improved.
??? (all)/0.3 all Sophia triples
                        time:   [6.9292 s 6.9454 s 6.9653 s]
                        change: [-11.755% -11.514% -11.252%] (p = 0.00 < 0.05)
                        Performance has improved.

Now it's better but I wonder if it is possible to not lose 10% performance in the first place, then the triple cache would bring even more speed.

pchampin commented 1 year ago

Maybe the idx variable needs to be updated in the else part of the get_x_string function?

Duh! So silly of me... :facepalm: Thanks for fixing this.

KonradHoeffner commented 1 year ago

Cases

After benchmarking and thinking about this there seem to be the following cases:

triple pattern with three constants SPO: this is just an ask query, so performance shouldn't differ much regardless of implementation
triple patterns with two constants and one variable, SP?, S?O, ?PO: Arc won't do anything because the variable never has the same value twice and the constants we just want to return from the parameters but MownStr is perfect here. This case is my biggest worry because the queries can take a long time to execute and I don't like losing 10% performance because of Arc over MownStr. I don't understand why the overhead of Arc is that high actually, I wonder if there is some way to mitigate that.
triple patterns with one constant and two adjacent variables S?? ??O: there is one constant which we want to pass back, one variable which is easy to deduplicate (P in S?? and S in ??O), and one variable which is hard to deduplicate.
triple pattern with one constant and two disjointed variables ?P?: there is one constant that we want to pass back and two hard to deduplicate variables.
triple pattern with 3 variables ???, which is the best case for Arc but I think not used that often, and is equivalent to the triples() method.

Ideas for Solutions (numbers unrelated)

Use Arc as is: Just take the tradeoff with worse performance in some cases for benefits in others.
Keep it with MownStr and keep the good performance on two variable queries but lose the potential benefits in the other cases (especially the two variables cases need to be deduplicated and benchmarked)
Return normal references instead of Arc along with some kind of String pool. However this is hard to implement because the borrow checker of Rust does not allow returning references with lifetimes relating to something in the same struct. I tried passing the pool as initially empty Vec as parameter but it's not allowed to modify a Vec while there are references to it. Also this is complicated with iterators.
Create a new MownStr variant but when it gets cloned, instead of cloning the Box, which is slow, it creates a reference to the box.
Try Rc instead of Arc. But that may harm applications with multithreading.
Don't try optimizing it and instead tell users to use specific methods instead for each case when performance is an issue. This is feasible for Hdt but not for the Sophia Hdt Graph.

I favor solution 4 but I'm not sure if that's possible with Rust and as second favorite solution 3, but that may be difficult to do as well.

KonradHoeffner commented 1 year ago

I just forked your MownStr repository because I wanted to change clone() so that always returns a borrowed MownStr but I found the borrowed method. Won't that solve our use case without needing Arc?

KonradHoeffner commented 1 year ago

Unfortunately, borrowed returns MownStr instead of Mownstr<'a> so it can't be returned but I successfully changed added the lifetime to a local copy of MownStr. Is this feasible to add or will this break something?

KonradHoeffner commented 1 year ago

Benchmarking it on my Laptop without Sophia (due to MownStr incompatibilities with the local patched version) and strangely, there wasn't much difference with using borrowed() instead of clone():

 cargo bench  -- --load-baseline borrow  --baseline main                                                                                                                                                                   mownstr-borrow
    Finished bench [optimized] target(s) in 0.04s
     Running benches/bench.rs (target/release/deps/bench-561ba078d93cf93c)
??? (all)/0.1 all triple IDs
                        time:   [2.0441 s 2.0450 s 2.0458 s]
                        change: [-1.4375% -0.9551% -0.5860%] (p = 0.00 < 0.05)
                        Change within noise threshold.
??? (all)/0.2 all str triples
                        time:   [8.0467 s 8.0638 s 8.0855 s]
                        change: [-1.1257% -0.6344% -0.1437%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe

S??/1.1 (vincent, ?, ?) triple IDs
                        time:   [2.1847 µs 2.1909 µs 2.2014 µs]
                        change: [-0.2736% -0.0057% +0.2519%] (p = 0.97 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
S??/1.2 (vincent, ?, ?) str triples
                        time:   [5.1951 µs 5.2018 µs 5.2099 µs]
                        change: [+0.3254% +0.6764% +1.0194%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

?P? 1429618 triples/2.1 (?, type, ?) triple IDs
                        time:   [513.92 ms 514.61 ms 515.54 ms]
                        change: [-0.4190% -0.2138% -0.0151%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe
?P? 1429618 triples/2.2 (?, type, ?) str triples
                        time:   [968.11 ms 969.58 ms 971.04 ms]
                        change: [+0.2843% +0.4650% +0.6533%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild

??O 1414214 triples/3.1 (?, ?, person) triple IDs
                        time:   [209.90 ms 210.17 ms 210.62 ms]
                        change: [+0.0380% +0.1802% +0.3899%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
??O 1414214 triples/3.2 (?, ?, person) str triples
                        time:   [746.20 ms 754.78 ms 765.05 ms]
                        change: [+1.2616% +2.5159% +3.7646%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 23 outliers among 100 measurements (23.00%)
  3 (3.00%) high mild
  20 (20.00%) high severe

?PO 1414214 triples/4.1 (?, type, person) triple IDs
                        time:   [174.76 ms 174.93 ms 175.38 ms]
                        change: [+1.0783% +1.6029% +2.2160%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe
?PO 1414214 triples/4.2 (?, type, person) str subjects
                        time:   [468.19 ms 469.26 ms 470.53 ms]
                        change: [-0.7916% -0.5703% -0.2976%] (p = 0.00 < 0.05)
                        Change within noise threshold.
?PO 1414214 triples/4.3 (?, type, person) str triples
                        time:   [471.40 ms 471.85 ms 472.34 ms]
                        change: [-2.6150% -2.4799% -2.3429%] (p = 0.00 < 0.05)
                        Performance has improved.

pchampin commented 1 year ago

Is this feasible to add or will this break something?

It will break something. Consider the following code (assuming that borrowed returns MownStr<'a>:

    let m1: MownStr<'static>;
    {
        let m2: MownStr<'static> = "hello world".to_string().into();
        m1 = m2.borrowed();
        println!("{}", m1);
    }
    println!("{}", m1);

It would compile, but the second println! would print garbage, because the text owned by m2 and borrowed by m1 is unallocated at the closing curly bracket...

KonradHoeffner commented 1 year ago

This is difficult to benchmark with Criterion.rs due to variance in runtimes and the long time due to the multiple runs. In addition to Criterion.rs, I benchmarked some cases with Iai, which uses Valgrind to estimate the number of CPU cycles.

persondata 10k triples

	po sophia	po sophia - load	/ mownstr	po	po - load	/ mownstr	o sophia	o sophia - load	/ mownstr	o	o - load	/ mownstr	all sophia	all sophia - load	/ mownstr
mownstr	32025873	6453360	1	30946097	5373584	1	35733746	10161233	1	34930677	9358164	1	168333124	142760611	1
rc	31805323	6232810	0.9658240049	31146148	5573635	1.037228598	36615963	11043450	1.086821845	35477675	9905162	1.058451423	144665536	119093023	0.8342148592
arc	31831044	6258531	0.9698096805	31195529	5623016	1.046418182	36664104	11091591	1.091559558	35528558	9956045	1.063888707	145241514	119669001	0.8382494314
		rc / arc	0.9958902496		rc / arc	0.9912180581		rc / arc	0.9956596849		rc / arc	0.9948892356		rc / arc	0.9951869073

all	all - load	/ mownstr
143615972	118043459	1
126231005	100658492	0.8527240124
126843252	101270739	0.8579106361
	rc / arc	0.9939543544

persondata 100k triples

	po sophia	po sophia - load	/ mownstr
mownstr	302179289	64166525	1
rc	299909076	61896312	0.964619979
arc	300229249	62216485	0.9696096991
		rc / arc	0.994853888

persondata 1M triples

	po sophia	po sophia - load	/ mownstr
mownstr	3017508492	697058710	1
rc	2993299362	672849580	0.9652695969
arc	2996034366	675584584	0.9691932319
		rc / arc	0.9959516483

Conclusions

Rc saves less than 1% performance over Arc
?PO with Arc/Rc is faster with Sophia by ~4% but slower by ~4% without (need to check implementation)
??O with Arc/Rc is slower with both by ~6-9%
??? (all) with Arc/Rc is ~15% faster

KonradHoeffner commented 1 year ago

Enabling the triple cache for all patterns with at two variables gave a massive boost (down to 69%) to the tested ??O query:

	o	o - load	/ mownstr
mownstr	34930677	9358164	1
rc	35477675	9905162	1.058451423
arc	35528558	9956045	1.063888707
cached arc	31998853	6426340	0.686709487
		rc / arc	0.9948892356

KonradHoeffner / hdt

Code review sophia #28

Benchmarks

before

after

Benchmarking ??? pattern (all triples) on i9-10900k

Before

After

Comparison

Cases

Ideas for Solutions (numbers unrelated)

persondata 10k triples

persondata 100k triples

persondata 1M triples

Conclusions