Futurewei-io / blue-marlin

Blue Marlin is a critical web infrastructure for advertising based monetization. It is a cloud platform that adds intelligence to a plain Ad System.
Apache License 2.0
5 stars 4 forks source link

Prediction for all slot_ids may not be meaningful #297

Open jimmylao opened 2 years ago

jimmylao commented 2 years ago

The traffic distribution of slot_ids is extremely imbalanced. There are 91,833 slot_ids in total. The majority of traffic is concentrated in very small number of slot_ids. As can be shown in the table below, top 39 slot_ids contribute to 60%+ of total traffic. <html xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

si | cnt | cnt % | accumulated cnt % -- | -- | -- | -- x95gdcf5ck | 95,165,544,824 | 9.20773% | 9.21% o1hlm2e8l2 | 83,182,959,720 | 8.04836% | 17.26% o4whh18epb | 64,073,372,952 | 6.19941% | 23.46% b1p1rnmp3e | 43,507,996,387 | 4.20961% | 27.67% m4m9lasyit | 31,181,452,281 | 3.01696% | 30.68% g69br6d26q | 18,259,811,606 | 1.76673% | 32.45% e24eujv8n2 | 16,458,924,882 | 1.59248% | 34.04% r5heo0835x | 16,332,621,000 | 1.58026% | 35.62% g6wthupg81 | 15,840,100,154 | 1.53261% | 37.15% v9xc4b090z | 14,982,876,279 | 1.44967% | 38.60% y8fe26merw | 14,518,143,386 | 1.40470% | 40.01% p3jrj32xbg | 12,078,586,971 | 1.16866% | 41.18% a6thryb3g3 | 11,746,995,224 | 1.13658% | 42.31% d8scgz0ej7 | 10,035,368,094 | 0.97097% | 43.28% k0cjgdagyu | 9,980,781,160 | 0.96569% | 44.25% n80j6t0l2j | 9,604,581,119 | 0.92929% | 45.18% m4k5b7oaav | 9,393,366,289 | 0.90885% | 46.09% u0uwv3o0f2 | 8,792,138,522 | 0.85068% | 46.94% v2q55qhxxa | 8,680,444,887 | 0.83988% | 47.78% x7sn2kq4kn | 8,628,738,889 | 0.83487% | 48.61% z0dzqwn4q1 | 8,302,335,136 | 0.80329% | 49.42% a47eavw7ex | 8,186,660,584 | 0.79210% | 50.21% k2cs75mwwc | 8,065,338,800 | 0.78036% | 50.99% k46r8x1z8b | 7,987,742,268 | 0.77285% | 51.76% x4aptgermv | 7,861,023,699 | 0.76059% | 52.52% b9367fkimq | 7,195,958,657 | 0.69624% | 53.22% o6etkl3n7d | 7,033,881,590 | 0.68056% | 53.90% x97s0ecpob | 7,012,508,521 | 0.67849% | 54.58% q1kpz6g3yf | 6,805,819,784 | 0.65850% | 55.24% T7q4l0r6pu | 6,739,759,342 | 0.65210% | 55.89% 67bcd2720e5011e79bc8fa163e05184e | 6,421,740,549 | 0.62133% | 56.51% 66bcd2720e5011e79bc8fa163e05184e | 6,399,483,117 | 0.61918% | 57.13% r8mstx2zb8 | 6,033,547,675 | 0.58378% | 57.71% n9w2d6zrmg | 5,596,013,857 | 0.54144% | 58.25% j9vx0jenar | 5,211,224,893 | 0.50421% | 58.76% n4o5q4fwe3 | 5,082,208,173 | 0.49173% | 59.25% x7i1r3mwer | 4,885,947,289 | 0.47274% | 59.72% n3yiww0227 | 4,841,354,331 | 0.46842% | 60.19%

The figure below shows the accumulated traffic % curve, indicating the same conclusion -> very few number of slot_ids occupy the majority of total traffic. image

Here's the quantitative relationship between "# of top slot_ids" and "% of total traffic".

image

For long-term improvement, it could be a good idea to design an adaptive filter based on actual situation and dynamically return a list of stable slot_ids for prediction. Linear regression or even simply use the mean value of each slot_id is good enough to predict those sparse slot_ids (99.5%+ of total slot_ids).

Building a deep learning model for all slot_ids will lead to model to learn noise distribution and will not be able to generate meaningful prediction result.