NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
792 stars 230 forks source link

[FEA] Spill performance with compression enabled #9596

Open abellina opened 11 months ago

abellina commented 11 months ago

I tested the worst case for spill by changing our plugin so that it would spill every buffer: https://github.com/abellina/spark-rapids/commit/84e18f156a3478139164f5630ede4327f9655732 with this patch https://github.com/NVIDIA/spark-rapids/pull/9454. I then made sure that every buffer was spilled from device to disk, forcing compression and decompression to happen. Note that we don't run with unspill on by default, so every time a buffer is read we are reading it from disk compressed, and need to decompress.

I ran NDS @ 3TB in our performance cluster in this mode. Our performance cluster has really great bandwidth to disk and lots of host memory. In this case, the compression-enabled case was 2x slower than the case where compression was not enabled, and I would catalog this as a the worst case scenario both for the spill simulated here and the fact that anything we do adds overhead in this environment, given the fast IO.

  1. We may observe different results in slower IO or restricted host memory scenarios (a cloud VM would be good to try), where performance may be up to par with compression or even beat the uncompressed case. We should find examples where compression can be a benefit in performance and use this data to modify the auto tuner for thresholds it could use, likely relative to disk bandwidth, overall capacity, and amount of spill detected in application history logs.

  2. We should also further study the results in our performance cluster and see if we can speed up spill. For example, right now spill is done completely serially, one buffer at a time, but we know that LZ4's bandwidth for a single thread is a limiting factor (re: multi-threaded shuffle). We could employ similar mechanisms at spill time to speed up compression (and also encryption) when we have several blocks to spill. A related issue is: https://github.com/NVIDIA/spark-rapids/issues/7666, because if you look at the stack traces that's really holding everything up (all tasks end up waiting for the catalog lock, because one task is spilling). So there are steps we can take, and we can re-run the spill-all case to figure out if we improve our worst case.

Name = benchmark
Means = 3802000.0, 7187000.0
Time diff = -3385000.0
Speedup = 0.5290107137887853
T-Test (test statistic, p value, df) = -27.525783960660327, 0.02311795793190776, 1.0
T-Test Confidence Interval = -4947553.244415962, -1822446.7555840379
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed

In terms of the amount spilled, with compression we spilled 12TB, without compression we spilled 25TB.

Full results ``` Name = query1 Means = 6415.0, 12243.0 Time diff = -5828.0 Speedup = 0.5239728824634485 T-Test (test statistic, p value, df) = -46.733296789404704, 0.013620323808451023, 1.0 T-Test Confidence Interval = -7412.561036590835, -4243.438963409165 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query3 Means = 8656.0, 12730.0 Time diff = -4074.0 Speedup = 0.6799685781618224 T-Test (test statistic, p value, df) = -196.01041638987795, 0.0032478592763304165, 1.0 T-Test Confidence Interval = -4338.093506098472, -3809.9064939015275 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query4 Means = 153913.0, 266906.5 Time diff = -112993.5 Speedup = 0.5766551208007299 T-Test (test statistic, p value, df) = -39.287460187701285, 0.01620064873435169, 1.0 T-Test Confidence Interval = -149537.43890637613, -76449.56109362387 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query6 Means = 2076.0, 3271.5 Time diff = -1195.5 Speedup = 0.634571297569922 T-Test (test statistic, p value, df) = -14.530994669814685, 0.04374219577439338, 1.0 T-Test Confidence Interval = -2240.870128306454, -150.12987169354642 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query7 Means = 34275.0, 75803.0 Time diff = -41528.0 Speedup = 0.4521588855322349 T-Test (test statistic, p value, df) = -41.196223331454945, 0.015450318657754002, 1.0 T-Test Confidence Interval = -54336.53504577591, -28719.464954224088 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query9 Means = 53082.0, 136078.5 Time diff = -82996.5 Speedup = 0.39008366494339664 T-Test (test statistic, p value, df) = -12.810600619381573, 0.04959419426750873, 1.0 T-Test Confidence Interval = -165316.64663011135, -676.3533698886458 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query11 Means = 78715.0, 144497.0 Time diff = -65782.0 Speedup = 0.5447517941548959 T-Test (test statistic, p value, df) = -19.47654123478562, 0.032657812876891255, 1.0 T-Test Confidence Interval = -108697.19474100177, -22866.805258998225 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query12 Means = 964.0, 1517.0 Time diff = -553.0 Speedup = 0.6354647330257086 T-Test (test statistic, p value, df) = -31.9274698861863, 0.019933045789918242, 1.0 T-Test Confidence Interval = -773.077921748727, -332.9220782512729 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query14_part1 Means = 115181.0, 221352.0 Time diff = -106171.0 Speedup = 0.5203521992121146 T-Test (test statistic, p value, df) = -18.30879791819945, 0.034736734446920055, 1.0 T-Test Confidence Interval = -179853.08820147382, -32488.911798526184 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query14_part2 Means = 109677.0, 206264.0 Time diff = -96587.0 Speedup = 0.5317311794593337 T-Test (test statistic, p value, df) = -21.01942346408533, 0.030264394239590434, 1.0 T-Test Confidence Interval = -154973.6726399373, -38200.327360062714 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query16 Means = 52068.0, 148311.0 Time diff = -96243.0 Speedup = 0.3510730829136072 T-Test (test statistic, p value, df) = -22.173153215330068, 0.02869184503407946, 1.0 T-Test Confidence Interval = -151394.527190231, -41091.472809769 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query17 Means = 7584.0, 13227.0 Time diff = -5643.0 Speedup = 0.5733726468586982 T-Test (test statistic, p value, df) = -814.4968922592645, 0.0007816106587309643, 1.0 T-Test Confidence Interval = -5731.031168699491, -5554.968831300509 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query22 Means = 8517.0, 16893.0 Time diff = -8376.0 Speedup = 0.5041733262297994 T-Test (test statistic, p value, df) = -19.038920687922463, 0.033407109622622284, 1.0 T-Test Confidence Interval = -13965.979212417667, -2786.0207875823335 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query23_part1 Means = 209113.0, 519179.5 Time diff = -310066.5 Speedup = 0.4027759185406974 T-Test (test statistic, p value, df) = -206.35962794430557, 0.0030849774036093864, 1.0 T-Test Confidence Interval = -329158.25971170206, -290974.74028829794 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query24_part1 Means = 113715.0, 196010.5 Time diff = -82295.5 Speedup = 0.5801474920986376 T-Test (test statistic, p value, df) = -20.608687520318735, 0.030866635017350957, 1.0 T-Test Confidence Interval = -133034.46485916903, -31556.53514083098 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query24_part2 Means = 117606.0, 198203.5 Time diff = -80597.5 Speedup = 0.5933598548966088 T-Test (test statistic, p value, df) = -24.44601435303959, 0.02602735258808861, 1.0 T-Test Confidence Interval = -122489.33240487019, -38705.6675951298 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query25 Means = 6139.0, 9359.5 Time diff = -3220.5 Speedup = 0.6559111063625194 T-Test (test statistic, p value, df) = -19.07032350692502, 0.03335219931563161, 1.0 T-Test Confidence Interval = -5366.259737050089, -1074.740262949911 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query26 Means = 16413.0, 40341.0 Time diff = -23928.0 Speedup = 0.4068565479289061 T-Test (test statistic, p value, df) = -23.179257116055982, 0.02744804303055482, 1.0 T-Test Confidence Interval = -37044.64413622413, -10811.355863775869 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query29 Means = 13873.0, 23951.0 Time diff = -10078.0 Speedup = 0.5792242495094151 T-Test (test statistic, p value, df) = -138.53657173554876, 0.004595239422516207, 1.0 T-Test Confidence Interval = -11002.327271344653, -9153.672728655347 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query31 Means = 14874.0, 21506.0 Time diff = -6632.0 Speedup = 0.6916209429926532 T-Test (test statistic, p value, df) = -46.132373316452984, 0.0137976878869646, 1.0 T-Test Confidence Interval = -8458.646750514436, -4805.353249485565 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query33 Means = 3552.0, 5333.0 Time diff = -1781.0 Speedup = 0.6660416276017251 T-Test (test statistic, p value, df) = -20.984914886259663, 0.030314087294278914, 1.0 T-Test Confidence Interval = -2859.3818165687626, -702.6181834312374 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query35 Means = 9942.0, 13643.5 Time diff = -3701.5 Speedup = 0.7286986477076997 T-Test (test statistic, p value, df) = -83.80635378060391, 0.007595958211485061, 1.0 T-Test Confidence Interval = -4262.698700459254, -3140.301299540746 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query36 Means = 20697.0, 37183.0 Time diff = -16486.0 Speedup = 0.5566253395368851 T-Test (test statistic, p value, df) = -85.74951835910063, 0.0074238424536921714, 1.0 T-Test Confidence Interval = -18928.86493141087, -14043.13506858913 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query38 Means = 27160.0, 43873.0 Time diff = -16713.0 Speedup = 0.6190595582704624 T-Test (test statistic, p value, df) = -16.162906279675404, 0.039337561907621145, 1.0 T-Test Confidence Interval = -29851.651928399006, -3574.3480716009963 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query40 Means = 4458.0, 7929.0 Time diff = -3471.0 Speedup = 0.5622398789254635 T-Test (test statistic, p value, df) = -250.4978480446489, 0.0025414046290069396, 1.0 T-Test Confidence Interval = -3647.0623373989815, -3294.9376626010185 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query42 Means = 2535.0, 3865.0 Time diff = -1330.0 Speedup = 0.6558861578266494 T-Test (test statistic, p value, df) = -127.97930967036704, 0.0049742948155542515, 1.0 T-Test Confidence Interval = -1462.0467530492363, -1197.9532469507637 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query43 Means = 15179.0, 25691.0 Time diff = -10512.0 Speedup = 0.5908294733564283 T-Test (test statistic, p value, df) = -13.079969891640832, 0.04857685119061344, 1.0 T-Test Confidence Interval = -20723.615569140937, -300.3844308590651 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query45 Means = 3242.0, 5554.0 Time diff = -2312.0 Speedup = 0.583723442563918 T-Test (test statistic, p value, df) = -34.226508265805506, 0.01859490638348658, 1.0 T-Test Confidence Interval = -3170.3038948200356, -1453.6961051799644 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query49 Means = 5589.0, 8519.0 Time diff = -2930.0 Speedup = 0.6560629181828853 T-Test (test statistic, p value, df) = -187.95958763617816, 0.003386971496688935, 1.0 T-Test Confidence Interval = -3128.0701295738545, -2731.9298704261455 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query50 Means = 40862.0, 54916.5 Time diff = -14054.5 Speedup = 0.7440750958273015 T-Test (test statistic, p value, df) = -25.719078790255452, 0.02474035940893158, 1.0 T-Test Confidence Interval = -20997.958431172337, -7111.041568827662 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query53 Means = 18280.0, 29691.5 Time diff = -11411.5 Speedup = 0.615664415741879 T-Test (test statistic, p value, df) = -16.207706265331893, 0.0392291037318845, 1.0 T-Test Confidence Interval = -20357.667519085753, -2465.332480914247 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query59 Means = 54282.0, 82104.0 Time diff = -27822.0 Speedup = 0.6611370944168372 T-Test (test statistic, p value, df) = -180.48358639768276, 0.00352726400587758, 1.0 T-Test Confidence Interval = -29780.69350356367, -25863.30649643633 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query61 Means = 5088.0, 8353.0 Time diff = -3265.0 Speedup = 0.6091224709685144 T-Test (test statistic, p value, df) = -117.81553930650801, 0.005403399998578176, 1.0 T-Test Confidence Interval = -3617.124674797963, -2912.875325202037 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query62 Means = 65220.0, 135827.5 Time diff = -70607.5 Speedup = 0.48016785996944655 T-Test (test statistic, p value, df) = -26.600495355175532, 0.023921363714386055, 1.0 T-Test Confidence Interval = -104334.44150799242, -36880.558492007585 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query64 Means = 115962.0, 262443.5 Time diff = -146481.5 Speedup = 0.4418551040509672 T-Test (test statistic, p value, df) = -25.121382283172483, 0.025328376725672595, 1.0 T-Test Confidence Interval = -220570.73235670896, -72392.26764329104 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query65 Means = 35580.0, 82682.0 Time diff = -47102.0 Speedup = 0.4303234077550132 T-Test (test statistic, p value, df) = -51.60218667812097, 0.012335525640321495, 1.0 T-Test Confidence Interval = -58700.10647615792, -35503.89352384208 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query66 Means = 41169.0, 67463.0 Time diff = -26294.0 Speedup = 0.6102456161155003 T-Test (test statistic, p value, df) = -15.459111993963361, 0.04112358088199367, 1.0 T-Test Confidence Interval = -47905.65191572499, -4682.348084275007 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query69 Means = 5726.0, 7168.0 Time diff = -1442.0 Speedup = 0.798828125 T-Test (test statistic, p value, df) = -277.5130293904801, 0.0022940076663786795, 1.0 T-Test Confidence Interval = -1508.023376524618, -1375.976623475382 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query71 Means = 7515.0, 10410.0 Time diff = -2895.0 Speedup = 0.7219020172910663 T-Test (test statistic, p value, df) = -417.85725732599167, 0.0015235311720830304, 1.0 T-Test Confidence Interval = -2983.031168699491, -2806.968831300509 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query72 Means = 35595.0, 64122.0 Time diff = -28527.0 Speedup = 0.5551136895293347 T-Test (test statistic, p value, df) = -43.803380662692696, 0.01453105217548824, 1.0 T-Test Confidence Interval = -36801.92985775214, -20252.070142247863 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query74 Means = 51937.0, 87044.5 Time diff = -35107.5 Speedup = 0.596671817288858 T-Test (test statistic, p value, df) = -16.580224601697168, 0.03834987389173396, 1.0 T-Test Confidence Interval = -62012.02593378188, -8202.974066218121 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query75 Means = 69209.0, 130502.5 Time diff = -61293.5 Speedup = 0.5303270052297848 T-Test (test statistic, p value, df) = -27.85345826412777, 0.022846227941127655, 1.0 T-Test Confidence Interval = -89254.39995817577, -33332.60004182423 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query78 Means = 121943.0, 238542.0 Time diff = -116599.0 Speedup = 0.5112013817273269 T-Test (test statistic, p value, df) = -23.645403595799497, 0.026907581835764048, 1.0 T-Test Confidence Interval = -179255.1843218626, -53942.815678137405 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query79 Means = 18694.0, 31362.5 Time diff = -12668.5 Speedup = 0.5960621761658031 T-Test (test statistic, p value, df) = -25.708829122069506, 0.02475021303093756, 1.0 T-Test Confidence Interval = -18929.716873751284, -6407.283126248716 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query80 Means = 16541.0, 33240.5 Time diff = -16699.5 Speedup = 0.4976158601705751 T-Test (test statistic, p value, df) = -31.559609886520967, 0.020165231636396767, 1.0 T-Test Confidence Interval = -23422.880509423612, -9976.119490576388 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query82 Means = 12153.0, 24937.5 Time diff = -12784.5 Speedup = 0.4873383458646617 T-Test (test statistic, p value, df) = -49.04408316581243, 0.012978763787733116, 1.0 T-Test Confidence Interval = -16096.672722318342, -9472.327277681658 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query86 Means = 7621.0, 11501.0 Time diff = -3880.0 Speedup = 0.6626380314755239 T-Test (test statistic, p value, df) = -21.133198532601398, 0.030101702837526164, 1.0 T-Test Confidence Interval = -6212.825970536507, -1547.1740294634933 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query87 Means = 27979.0, 46990.5 Time diff = -19011.5 Speedup = 0.5954182228322746 T-Test (test statistic, p value, df) = -21.128574865637287, 0.030108280319781196, 1.0 T-Test Confidence Interval = -30444.548034846368, -7578.45196515363 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query88 Means = 252115.0, 367616.5 Time diff = -115501.5 Speedup = 0.685809804510951 T-Test (test statistic, p value, df) = -26.818750097247364, 0.023726869273490603, 1.0 T-Test Confidence Interval = -170223.87524282097, -60779.12475717902 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query90 Means = 15768.0, 33103.5 Time diff = -17335.5 Speedup = 0.4763242557433504 T-Test (test statistic, p value, df) = -29.832058395042495, 0.021332134620458392, 1.0 T-Test Confidence Interval = -24719.11427466979, -9951.885725330209 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query91 Means = 1618.0, 2119.0 Time diff = -501.0 Speedup = 0.7635677206229353 T-Test (test statistic, p value, df) = -28.92524848640025, 0.022000375278840318, 1.0 T-Test Confidence Interval = -721.077921748727, -280.9220782512729 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query93 Means = 50638.0, 102276.5 Time diff = -51638.5 Speedup = 0.49510884709586267 T-Test (test statistic, p value, df) = -46.04401834061543, 0.013824156288149609, 1.0 T-Test Confidence Interval = -65888.54543323007, -37388.45456676993 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query95 Means = 209807.0, 354026.0 Time diff = -144219.0 Speedup = 0.5926316146271743 T-Test (test statistic, p value, df) = -24.685703668028058, 0.025774913774320417, 1.0 T-Test Confidence Interval = -218451.28300584562, -69986.71699415437 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query97 Means = 23448.0, 29446.0 Time diff = -5998.0 Speedup = 0.796305100862596 T-Test (test statistic, p value, df) = -15.59885997567286, 0.04075617088568512, 1.0 T-Test Confidence Interval = -10883.72986282174, -1112.2701371782596 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query98 Means = 3617.0, 4688.5 Time diff = -1071.5 Speedup = 0.7714620880878745 T-Test (test statistic, p value, df) = -16.94878940922422, 0.037517876548842255, 1.0 T-Test Confidence Interval = -1874.7844143828538, -268.2155856171463 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = query99 Means = 122054.0, 239049.5 Time diff = -116995.5 Speedup = 0.5105804446359437 T-Test (test statistic, p value, df) = -57.3164051073185, 0.011105985933025706, 1.0 T-Test Confidence Interval = -142931.6830780875, -91059.31692191251 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed -------------------------------------------------------------------- Name = benchmark Means = 3802000.0, 7187000.0 Time diff = -3385000.0 Speedup = 0.5290107137887853 T-Test (test statistic, p value, df) = -27.525783960660327, 0.02311795793190776, 1.0 T-Test Confidence Interval = -4947553.244415962, -1822446.7555840379 ALERT: significant change has been detected (p-value < 0.05) ALERT: regression in performance has been observed ```
Configuration used No compression: ``` export SPARK_CONF=("--master" "spark://master-node:7077" "--conf" "spark.shuffle.spill.compress=false" "--conf" "spark.rapids.memory.host.spillStorageSize=1" "--conf" "spark.locality.wait=0" "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin" "--conf" "spark.sql.adaptive.enabled=true" "--conf" "spark.sql.files.maxPartitionBytes=2gb" "--conf" "spark.driver.maxResultSize=2GB" "--conf" "spark.driver.memory=50G" "--conf" "spark.executor.cores=16" "--conf" "spark.executor.memory=16G" "--conf" "spark.executor.resource.gpu.amount=1" "--conf" "spark.task.resource.gpu.amount=0.0625" "--conf" "spark.rapids.memory.pinnedPool.size=8g" "--conf" "spark.rapids.sql.concurrentGpuTasks=4" "--conf" "spark.executor.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true" "--conf" "spark.driver.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR" "--conf" "spark.executor.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR" "--conf" "spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager" "--conf" "spark.rapids.shuffle.multiThreaded.writer.threads=32" "--conf" "spark.rapids.shuffle.multiThreaded.reader.threads=32" "--conf" "spark.rapids.shuffle.mode=MULTITHREADED") ``` Compression: ``` export SPARK_CONF=("--master" "spark://master-node:7077" "--conf" "spark.shuffle.spill.compress=true" "--conf" "spark.rapids.memory.host.spillStorageSize=1" "--conf" "spark.locality.wait=0" "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin" "--conf" "spark.sql.adaptive.enabled=true" "--conf" "spark.sql.files.maxPartitionBytes=2gb" "--conf" "spark.driver.maxResultSize=2GB" "--conf" "spark.driver.memory=50G" "--conf" "spark.executor.cores=16" "--conf" "spark.executor.memory=16G" "--conf" "spark.executor.resource.gpu.amount=1" "--conf" "spark.task.resource.gpu.amount=0.0625" "--conf" "spark.rapids.memory.pinnedPool.size=8g" "--conf" "spark.rapids.sql.concurrentGpuTasks=4" "--conf" "spark.executor.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true" "--conf" "spark.driver.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR" "--conf" "spark.executor.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR" "--conf" "spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager" "--conf" "spark.rapids.shuffle.multiThreaded.writer.threads=32" "--conf" "spark.rapids.shuffle.multiThreaded.reader.threads=32" "--conf" "spark.rapids.shuffle.mode=MULTITHREADED") ```
### Tasks
- [ ] Run standard on Dataproc GPU cluster with slow disks
- [ ] Run with forced spill (reasonable use case) on Dataproc GPU cluster with slow disks
mattahrens commented 11 months ago

This work can help inform this autotuner enhancement: https://github.com/NVIDIA/spark-rapids-tools/issues/644.