[LDB-187] performance improvements needed before re-enabling index-only scans

dynamobi-build commented 12 years ago

[reporter="jvs", created="Mon, 3 Nov 2008 10:47:42 -0500 (GMT-05:00)"] Originally from LucidEra JIRA.

dynamobi-build commented 12 years ago

[author="jvs", created="Mon, 3 Nov 2008 10:50:49 -0500 (GMT-05:00)"]

[ Permlink | Delete | « Hide ]
Zelaine Fong - 17/Apr/07 01:46 PM
Based on profiling an analyze table on the kseq column in the bench1m table (the only one that's normally indexed), it looks like the extra overhead from using key only scans comes from the restarts on the minus stream. In the case of kseq, since it's a unique key, you end up doing a restart on the deletion index scan stream once per row in the table.

I can optimize the special case where the deletion index is empty because in that case, there's no need to do any restarts since you don't even need to read the stream. By doing so

analyze table bench1m compute statistics for column("kseq");

runs in about 5 seconds, whereas previously it took about 15 seconds. Without the index, the statement also takes about 15 seconds.

Question now is ... are there any additional special cases that are worth optimizing? E.g., the deletion index bitmap contains a single segment.
[ Show » ]
Zelaine Fong - 17/Apr/07 01:46 PM I've taken a break from looking at the btree corruption problem and have started looking at the performance of key only scans. Based on profiling an analyze table on the kseq column in the bench1m table (the only one that's normally indexed), it looks like the extra overhead from using key only scans comes from the restarts on the minus stream. In the case of kseq, since it's a unique key, you end up doing a restart on the deletion index scan stream once per row in the table. I can optimize the special case where the deletion index is empty because in that case, there's no need to do any restarts since you don't even need to read the stream. By doing so analyze table bench1m compute statistics for column("kseq"); runs in about 5 seconds, whereas previously it took about 15 seconds. Without the index, the statement also takes about 15 seconds. Question now is ... are there any additional special cases that are worth optimizing? E.g., the deletion index bitmap contains a single segment.

[ Permlink | Delete | « Hide ]
John Sichi - 17/Apr/07 07:52 PM
For non-null keys in unique indexes, we store at most one deleted entry, but I can't think of any obvious way to take advantage of that.

To generalize the single-segment idea, transformation into some better in-memory form in the case where the deletion index could fit into a limited amount of memory is probably the way (degrading to the current per-key restart in case the memory turns out to be insufficient).

To make the degradation more graceful, at the time the index-only scans were being designed, I think John P proposed using a Bloom filter. That could be a win for something like a unique index with only a small percentage of deletions. Build the Bloom filter on first pass over the deletion index (using whatever fixed amount of memory has been assigned for a bitmap), and for each subsequent key, defer the restart until the Bloom filter test shows up a potential deleted RID (at which point restart kicks in). Probably easier said than done.

[ Permlink | Delete | « Hide ]
Zelaine Fong - 18/Apr/07 03:29 PM
Regarding performance, I timed analyze table against the tables in TPC-H (100M), and even with the fix to avoid restarts on the deletion index scan stream in the case where the deletion index is empty, analyze table is slower with index-only scans than without.

Profiling shows that the culprit is repeated calls to LbmEntry::getCompressedRowCount in the construction of the bitmaps in the output tuple from the minus stream by the SegmentWriter class. SegmentWriter tries to construct large bitmaps by using LbmEntry::mergeEntry to combine bitmaps into larger, compressed bitmaps. Problem is mergeEntry recomputes the rowCount of the current bitmap each time it's called. That entails walking through the entire bitmap entry each time. Some of the indexed columns in the TPC-H schema have low cardinalities, so those columns will generate large bitmaps.

LbmEntry should instead keep the current rowCount in a state variable and then adjust that rowCount based on the new inputTuple being merged in.
[ Show » ]
Zelaine Fong - 18/Apr/07 03:29 PM Regarding performance, I timed analyze table against the tables in TPC-H (100M), and even with the fix to avoid restarts on the deletion index scan stream in the case where the deletion index is empty, analyze table is slower with index-only scans than without. Profiling shows that the culprit is repeated calls to LbmEntry::getCompressedRowCount in the construction of the bitmaps in the output tuple from the minus stream by the SegmentWriter class. SegmentWriter tries to construct large bitmaps by using LbmEntry::mergeEntry to combine bitmaps into larger, compressed bitmaps. Problem is mergeEntry recomputes the rowCount of the current bitmap each time it's called. That entails walking through the entire bitmap entry each time. Some of the indexed columns in the TPC-H schema have low cardinalities, so those columns will generate large bitmaps. LbmEntry should instead keep the current rowCount in a state variable and then adjust that rowCount based on the new inputTuple being merged in.

[ Permlink | Delete | « Hide ]
Zelaine Fong - 26/Apr/07 05:08 PM
Here's a summary of my performance related findings.

Analyze on the 100M TPC-H LINEITEM table runs in the following times:

without key-only scans -- 25-26 secs
with key-only scans -- 30-31 secs

Modifying the minus stream to avoid restarts in the case of an empty deletion index has no impact. It still takes about 30-31 secs. That's likely due to the fact that there are only 2 indexed columns in the LINEITEM will a large number of unique values (150K and 20K). The other 8 indexed columns have between 3 and ~2500 unique key values in a table with 600K+ rows.

Profiling indicates one hotspot is computing the rowcount in LbmEntry. Fixing that to store the rowcount in a member variable to avoid unnecessary recounts cuts the time down to 24-25 seconds. However, the rowcount time still shows up in the profile. So, I concluded that the slowdown was due to the minus stream having to tear apart and rebuild large output bitmaps. Changing the minus stream to avoid the teardown and rebuild improves the time to 19-20 seconds. If I go to the further extreme of completely taking out the minus stream, the time goes down to 17-18 seconds.

So, if we are going to improve keyonly scans for both the case where you have a large number of keys (which currently results in a large number of restarts) and cases where you have a small number of unique keys (and therefore large bitmaps), we probably need to do the following:

1) Add a bloom filter to avoid doing restarts of the deletion index when we know we don't have matching rids
2) Combine the counting of rids with the subtracting of deleted rids into a single exec stream to avoid rebuilding the output bitmaps. This would also avoid the cost of passing around the large bitmaps from the minus stream to the stream that does the counting. Currently, the subtraction is being done in the minus stream and then the actual counting is done downstream in LbmSortedAggExecStream.

I've also attached an email exchange between JVS and me with some additional info.
[ Show » ]
Zelaine Fong - 26/Apr/07 05:08 PM Here's a summary of my performance related findings. Analyze on the 100M TPC-H LINEITEM table runs in the following times: - without key-only scans -- 25-26 secs - with key-only scans -- 30-31 secs Modifying the minus stream to avoid restarts in the case of an empty deletion index has no impact. It still takes about 30-31 secs. That's likely due to the fact that there are only 2 indexed columns in the LINEITEM will a large number of unique values (150K and 20K). The other 8 indexed columns have between 3 and ~2500 unique key values in a table with 600K+ rows. Profiling indicates one hotspot is computing the rowcount in LbmEntry. Fixing that to store the rowcount in a member variable to avoid unnecessary recounts cuts the time down to 24-25 seconds. However, the rowcount time still shows up in the profile. So, I concluded that the slowdown was due to the minus stream having to tear apart and rebuild large output bitmaps. Changing the minus stream to avoid the teardown and rebuild improves the time to 19-20 seconds. If I go to the further extreme of completely taking out the minus stream, the time goes down to 17-18 seconds. So, if we are going to improve keyonly scans for both the case where you have a large number of keys (which currently results in a large number of restarts) and cases where you have a small number of unique keys (and therefore large bitmaps), we probably need to do the following: 1) Add a bloom filter to avoid doing restarts of the deletion index when we know we don't have matching rids 2) Combine the counting of rids with the subtracting of deleted rids into a single exec stream to avoid rebuilding the output bitmaps. This would also avoid the cost of passing around the large bitmaps from the minus stream to the stream that does the counting. Currently, the subtraction is being done in the minus stream and then the actual counting is done downstream in LbmSortedAggExecStream. I've also attached an email exchange between JVS and me with some additional info.

[ Permlink | Delete | « Hide ]
John Sichi - 27/Apr/07 12:28 AM
To estimate the best gain possible, can you tweak it to remove the sorts on top too? The index-only scan will allow us to do that, whereas the rowscan will not. 1gig TPC-H would be better, but maybe you can simulate the scaleup effect by reducing cache size to squeeze down the memory available for hashing and sorting.

In addition to that, can you provide dev time estimates for

executor changes for (1) (bloom filter)
optimizer changes to decide whether to use index-only scan if only (1) is available
executor changes for (2) (combine count with subtraction)

I guess if we have both (1) and (2) no optimizer work is needed (unless we needed to make it cost-based to avoid the medium-cardinality case).
[ Show » ]
John Sichi - 27/Apr/07 12:28 AM To estimate the best gain possible, can you tweak it to remove the sorts on top too? The index-only scan will allow us to do that, whereas the rowscan will not. 1gig TPC-H would be better, but maybe you can simulate the scaleup effect by reducing cache size to squeeze down the memory available for hashing and sorting. In addition to that, can you provide dev time estimates for - executor changes for (1) (bloom filter) - optimizer changes to decide whether to use index-only scan if only (1) is available - executor changes for (2) (combine count with subtraction) I guess if we have both (1) and (2) no optimizer work is needed (unless we needed to make it cost-based to avoid the medium-cardinality case).

[ Permlink | Delete | « Hide ]
Zelaine Fong - 27/Apr/07 02:40 PM
I've gathered some numbers using TPC-H 1G for the LINEITEM table (6+ million rows). I've summarized the numbers and included time estimates for the different improvements in the attached file named tpch1g.txt.
[ Show » ]
Zelaine Fong - 27/Apr/07 02:40 PM I've gathered some numbers using TPC-H 1G for the LINEITEM table (6+ million rows). I've summarized the numbers and included time estimates for the different improvements in the attached file named tpch1g.txt.

dynamobi-build commented 12 years ago

[author="zfong", created="Wed, 17 Dec 2008 14:31:06 -0500 (GMT-05:00)"] Index-only scans, when enabled, are always used, provided they match the rule pattern in question. Maybe we should only be using it when the column in question is low cardinality. That way, you minimize the minus stream restart overhead and you only use it in cases where you know you'll be reading a small number of index pages.

However, to do this, we would have to then decide what's the cutoff point on the cardinality.

dynamobi-build commented 12 years ago

[author="zfong", created="Wed, 17 Dec 2008 16:31:14 -0500 (GMT-05:00)"] Hmmm ... scratch my previous comment. Running select count-distinct queries on the various columns in a 1 million row bench table, index-only scans only win for the columns with cardinalities of 2, 4, 5, and 10. And even then, the difference is less than a second. So, without addressing the other overhead associated with index-only scans, its usefulness is still limited.

These are the detailed numbers. The first number is w/o index only scans. All numbers are in seconds.

kseq - 5.128 vs 13.963
k2 - .934 vs .219
k4 - .922 vs .238
k5 - .915 vs .275
k10 - .924 vs .59
k25 - .936 vs 1.472
k100 - 1.19 vs 3.06
k1k - .924 vs 3.228
k10k - .966 vs 2.591
k40k - 1.229 vs 1.571
k100k - 1.385 vs 2.161
k250k - 1.578 vs 3.978
k500k - 1.733 vs 6.322

dynamobi-build commented 12 years ago

[author="zfong", created="Fri, 16 Jan 2009 19:21:58 -0500 (GMT-05:00)"] Made changes in the lcs branch in Eigenchange 12235 to add bloom filtering into the minus exec stream to reduce subtrahend restarts.

I ran some timings against a 1 million row bench table with indexes on all columns and deleted rows, comparing row scans with the current index-only scan and the new index only scan (with various bloom filter bitmap sizes). The details are in the attached spreadsheet (bloomTimings.xls), but here are the highlights for the high cardinality column cases (40K and above):

With no deleted rows, row scan is better than the current index-only scan. With 8K bits in a bloom filter bitmap, index-only scans are better than row scans.

With very few deleted rows (13), again, the 8K bloom filter is better than both row scans and the current index-only scans. Also, the current index-only scan degrades, as compared to its no deleted rows case.

With 11% of rows deleted, a 32K bloom filter is better than the current index only scan, but worse than row scans. An 8K bloom filter is better than the current index only scan in some cases but not all.

With 22% of rows deleted, you need a 128K bloom filter for it to be better than the current index-only scan for all the high cardinality cases. Also, all of the index only scan numbers are significantly worse than row scans.

Index-only scans are still disabled by default in LucidDB. The next step is potentially to enable it selectively only for high cardinality columns with few or no deleted rows.

LucidDB / luciddb

[LDB-187] performance improvements needed before re-enabling index-only scans #135