Pinot as "BIG DATA" solution - particularly supporting unique metrics

Hi Guys, Hoping you can help me with the research that I started for my project on Pinot. We are talking about a system with retention of 12 month and around 100~ milliard rows of data per year.

The objective is to make analytic queries on web service; therefore there are a few items I need to fill, please see the list below: From user/concurrency point of view: • Is there a limitation for concurrent users running queries? • In case I need to add more users, what is the cost? (More hardware/servers/memory?)

From data point of view: • Is there a limitation on how many dimensions/lookup tables I can design? Is there any kind of performance implications? • What will be the cost of adding another dimension/lookup table? (More hardware/servers/memory or redesign of the solution?) • What about the history, as I mention before we are talking about one year retention, but if I need to add more retention time, example 3 years, what will be the implication?

From Programming point of view: • Are the windows functions supported? • If not, is there a workaround for it? • If partially, which one is supported or not supported. • Are Joins between tables supported? (Is there a limitation for table size?) • Are nested queries supported? • Are functions like Count, sum, avg., max, min, etc. supported? How is it working with uniqueness? (Count (distinct X), Sum (distinct X), etc.) • Are ranking functions, like in MSSQL, supported? (row_number, rank, dense_tank, ntile) • How Pinot is supporting the comparison between populations? Example, I need to get the population of users which didn’t purchase in the last 3 days but they click on my site at least 1 times. (when purchase and click are different events types in the system) • Flexibility from integration point of view (could it work with Java/python/etc.)

Some words about the company I’m working for, it is a personalized retargeting specialist providing web and app advertisers with display ads (banners), in real time, for visitors who have left their sites without completing a purchase. These users are served ads as they continue surfing the web or browsing other apps. Personalized retargeting is a form of online targeted advertising, in which online advertising is delivered to consumers based on their previous actions (such as pages browsed, products added to basket) on a company's website or app. I will appreciate any help you can provide, since the answer based on your experience to pointing me to documents/tutorials/white papers/etc.

If any additional details are needed please let me know.

Thanks! Guy

From user/concurrency point of view: • Is there a limitation for concurrent users running queries?

No, other than hardware limits. Some Pinot clusters at Linkedin process several thousands of queries per second.

• In case I need to add more users, what is the cost? (More hardware/servers/memory?)

Add more hardware, it scales near-linearly.

From data point of view: • Is there a limitation on how many dimensions/lookup tables I can design? Is there any kind of performance implications?

Tables are independent, there's no real performance implication to adding tables (as long as you're not planning on having thousands of tables, at least, zk may be a contention point with that many tables). Judging from your question though, it seems as if you're doing a star schema-style design. Pinot doesn't support joins, so you should flatten your dimensions into the table itself.

• What will be the cost of adding another dimension/lookup table? (More hardware/servers/memory or redesign of the solution?)

See above, no joins.

• What about the history, as I mention before we are talking about one year retention, but if I need to add more retention time, example 3 years, what will be the implication?

It'll increase query latency if those rows need to be scanned as part of query execution. If they don't need to be, then they're nearly free (they get rejected quickly at query processing time).

From Programming point of view: • Are the windows functions supported?

Not at the moment. Windowing functions require a second pass during query execution and can be implemented on the client side.

• If not, is there a workaround for it?

Compute them on the client side.

• If partially, which one is supported or not supported. • Are Joins between tables supported? (Is there a limitation for table size?)

No, and we don't foresee adding support for them this year.

• Are nested queries supported?

No, and we don't foresee adding support for them this year.

• Are functions like Count, sum, avg., max, min, etc. supported? How is it working with uniqueness? (Count (distinct X), Sum (distinct X), etc.)

Aggregate functions (count, sum, min, max, avg) are supported. You can do select distinctCount(X) for distinct count, select distinctCountHLL(X) for HyperLogLog approximation.

• Are ranking functions, like in MSSQL, supported? (row_number, rank, dense_tank, ntile)

No, and we don't foresee adding support for them this year.

• How Pinot is supporting the comparison between populations? Example, I need to get the population of users which didn’t purchase in the last 3 days but they click on my site at least 1 times. (when purchase and click are different events types in the system)

Put both events in the same table as part of your ETL process, roll up the rows on Hadoop, select sum(clickCount), sum(purchaseCount) where date >= 20160325 group by userId

• Flexibility from integration point of view (could it work with Java/python/etc.)

We have a Java API and a REST API for other languages.

Thanks Jean-François for your detailed answers! It's good info to start with...

On Thu, Jun 2, 2016 at 8:38 PM, Jean-François Im notifications@github.com wrote:

From user/concurrency point of view: • Is there a limitation for concurrent users running queries?

No, other than hardware limits. Some Pinot clusters at Linkedin process several thousands of queries per second.

• In case I need to add more users, what is the cost? (More hardware/servers/memory?)

Add more hardware, it scales near-linearly.

From data point of view: • Is there a limitation on how many dimensions/lookup tables I can design? Is there any kind of performance implications?

Tables are independent, there's no real performance implication to adding tables (as long as you're not planning on having thousands of tables, at least, zk may be a contention point with that many tables). Judging from your question though, it seems as if you're doing a star schema-style design. Pinot doesn't support joins, so you should flatten your dimensions into the table itself.

• What will be the cost of adding another dimension/lookup table? (More hardware/servers/memory or redesign of the solution?)

See above, no joins.

• What about the history, as I mention before we are talking about one year retention, but if I need to add more retention time, example 3 years, what will be the implication?

It'll increase query latency if those rows need to be scanned as part of query execution. If they don't need to be, then they're nearly free (they get rejected quickly at query processing time).

From Programming point of view: • Are the windows functions supported?

Not at the moment. Windowing functions require a second pass during query execution and can be implemented on the client side.

• If not, is there a workaround for it?

Compute them on the client side.

• If partially, which one is supported or not supported. • Are Joins between tables supported? (Is there a limitation for table size?)

No, and we don't foresee adding support for them this year.

• Are nested queries supported?

No, and we don't foresee adding support for them this year.

• Are functions like Count, sum, avg., max, min, etc. supported? How is it working with uniqueness? (Count (distinct X), Sum (distinct X), etc.)

Aggregate functions (count, sum, min, max, avg) are supported. You can do select distinctCount(X) for distinct count, select distinctCountHLL(X) for HyperLogLog approximation.

• Are ranking functions, like in MSSQL, supported? (row_number, rank, dense_tank, ntile)

No, and we don't foresee adding support for them this year.

• How Pinot is supporting the comparison between populations? Example, I need to get the population of users which didn’t purchase in the last 3 days but they click on my site at least 1 times. (when purchase and click are different events types in the system)

Put both events in the same table as part of your ETL process, roll up the rows on Hadoop, select sum(clickCount), sum(purchaseCount) where date >= 20160325

• Flexibility from integration point of view (could it work with Java/python/etc.)

We have a Java API and a REST API for other languages.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/linkedin/pinot/issues/126#issuecomment-223365537, or mute the thread https://github.com/notifications/unsubscribe/ASrsEz5zJWMU2xrcGXGaN9PpTqdtZIBRks5qHxUkgaJpZM4IneHn .

http://www.mythings.com?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature https://www.facebook.com/mythingscom/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature https://twitter.com/mythingsmedia?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature https://www.linkedin.com/company/mythings?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature We're hiring! https://www.comeet.co/jobs/mythings/50.00A?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature Guy Rasis DBA & BI Team Leader, myThings +972 545789867 guy.rasis

@jfim I have a fews question.

How about support datasketches for counting unique more effeicently? Ref: http://druid.io/docs/0.9.1.1/development/extensions-core/datasketches-aggregators.html
What about the TopN instead of group by? Ref: http://druid.io/docs/0.9.1.1/querying/topnquery.html
I have 10 billion rows in table Is that limit the rows when group by? Does TopN help here?

Thanks.

Pinot does not have support for sketches as of now. Its work in progress.

topN is supported, by default group by aggregation query in Pinot always does topN N=10 by default.

Yes, you use top N to limit the rows returned by group By.

On Tue, Aug 30, 2016 at 8:52 PM, Chanh Le notifications@github.com wrote:

@jfim https://github.com/jfim I have a fews question.

-

How about support datasketches for counting unique more effeicently? Ref: http://druid.io/docs/0.9.1.1/development/extensions-core/ datasketches-aggregators.html

What about the TopN instead of group by? Ref: http://druid.io/docs/0.9.1.1/querying/topnquery.html

I have 10 billion rows in table Is that limit the rows when group by? Does TopN help here?

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/linkedin/pinot/issues/126#issuecomment-243651398, or mute the thread https://github.com/notifications/unsubscribe-auth/AAz7ZxkGVG8D4d5-GP-8zVl89OP9VuwAks5qlPqBgaJpZM4IneHn .

apache / pinot

Pinot as "BIG DATA" solution - particularly supporting unique metrics #126

How about support datasketches for counting unique more effeicently? Ref: http://druid.io/docs/0.9.1.1/development/extensions-core/ datasketches-aggregators.html

What about the TopN instead of group by? Ref: http://druid.io/docs/0.9.1.1/querying/topnquery.html