QuXiangjie / Study-Review

自己欠缺的还太多了,希望通过总结每天的学习内容,整理每天的思绪来丰富自己的知识库。我想成为一名优秀的金融数据分析师,并行发展技术与商业业务。博客内容为:数理统计、财务业务、Python(数据分析及可视化)、Excel(数据分析)、SQL、英文
0 stars 0 forks source link

SQL Window function #15

Open QuXiangjie opened 4 months ago

QuXiangjie commented 4 months ago

TechTFQ Youtube

over clause

select e.*, max(salary) over() as max_salary
from employee e;

This will add a column with max salary

select e.*, max(salary) over(partition by dept_name) as max_salary
from employee e;

This will add a column with max salary in each department we can also use min, avg

Row number

select e.*, row_number() over () as rn
from employee e;

This will add a column with a range of number

select e.*, row_number() over (partition by dept_name) as rn
from employee e;

This will add a column with a range of number in each department

You can also order it by id

select *, row_number() over(partition by dept_name order by employee_id) as rn
from employee
where rn<3

rank

select * from (
         select e.*, rank() over (partition by dept_name order by salary desc) as rank
         from employee e) as x
where x.rnk<4;

dense rank

The only difference between dense rank and rank is that rank will skip duplicate value, but dense value don't skip a value

For example, salary. rank. dense rank

  1. 1 1
  2. 1 1
  3. 3 2
select * from (
         select e.*, dense_rank() over (partition by dept_name order by salary desc) as rank
         from employee e) as x
where x.rnk<4;

In summary, dense_rank, rank, row_number differences are like this:

image

lag

It is the record of previous record

select e.*, lag(salary) over (partition by dept_name order by emp_id) as prev_emp_salary
from employee e;
image

lag(salary, 2, 0) means 2 record before this record, if null, then zero

lead

it is the record of the later record

select e.*, lead(salary) over (partition by dept_name order by emp_id) as prev_emp_salary
from employee e;
image image

First_value and last_value

select *,
first_value(product_name) over (partition by product_category order by price desc) as most_exp_prodct,
last_value(product_name) over (partition by product_category order by price desc) as most_exp_prodct,
from product
image

but we notice that the least value is not correct, this is because of the frame Clause. last_value(product_name) over (partition by product_category order by price desc range between unbounded preceding and current row ) as most_exp_prodct

This statement means that the range will be between all other rows of processing and this current row In the example above, when we are processing the first record, the unbounded preceding rows are the first record when we are processing the second record, the unbounded preceding rows are the second record. So to solve this problem, we need to modify the statements to:

range between unbounded preceding and unbounded following

image

when there are duplicates value, the least value will consider the last one.

range between 2 preceding and 2 following

This will consider 2 prior and 2 latter rows of the current row

window clause

we can use window to define the window of the functions

image

nth value

write query to display the second most expensive product under each category

select *,
first_value(product_name) over w as most_exp_product,
last_value(product_name) over w as least_exp_product,
nth_value(product_name, 2) over w as second_most_exp_product
from product
window w as (partition by product_category order by price desc 
                      range between unbounded preceding and unbounded following);

if we put a number which is over the range, it will return a null value

NTILE

Write a query to segregate all the expensive phones, mid range phones and the cheaper phones

select *,
ntile(3) over (order by price desc) as buckets
from product
where product_category='Phone'

此方法用于分组

We can also use case to have a conditional name

image

CUME_DIST

It is used for cumulative distribution SELECT ROUND((SELECT COUNT() FROM immediate_cte) / (SELECT COUNT() FROM first_cte) 100, 2) AS immediate_percentage; Please notice that the MySQL doesn't support the ::numeric100 function

select *,
cume_dist() over ( order by price desc ) as come_distribution
round(cume_dist() over ( order by price desc ) :: numeric*100,2)as come_distribution
from product

::numeric can be used for calculating select (cume_dist_percentatge || '%') as come_dist_percentage 可以在列的后面增加符号

percent_rank

It is used for ralative rank of the current row / Percentage Ranking formula is current row number-1 / total row number -1

select product_name, per_rank
from (
select *,
round(percent_rank() over (order by price) :: numeric *100,2) as per_rank
from product) x
where x.product_name ='...'

the difference between percent_rank and cume_dist The difference is that the percent_rank assumes the current two isn't in the partition.

image