byzer-org / byzer-lang

Byzer (former MLSQL): A low-code open-source programming language for data pipeline, analytics and AI.
https://www.byzer.org
Apache License 2.0
1.84k stars 548 forks source link

Two tables join, and the columns will be misaligned. #1679

Open AdmondGuo opened 2 years ago

AdmondGuo commented 2 years ago

Here is my code:

select 
    owner,
    owner_email,
    owner_mgr,
    owner_mgr_email,
    week_begin,
    actual_hour,
    working_days*8 as working_hour
from(
    select 
        owner,
        owner_email,
        owner_mgr,
        owner_mgr_email,
        week_begin,
        sum(ts_hour) as actual_hour
    from workload_union
    where owner_active=1
    group by owner, owner_email, owner_mgr, owner_mgr_email, week_begin
) ht left join week_calendar wc on ht.week_begin = wc.week
as hour_table;

select
    tpe as tpe,
    wl.description,
    item as item,
    ky_project_id as ky_project_id,
    ky_project_name as ky_project_name,
    wl.ky_customer_id as ky_customer_id,
    occur_date as occur_date,
    ...
    outlier as outlier,
from 
workload_union wl left join hour_table ht on wl.week_begin = ht.week_begin and wl.owner_email = ht.owner_email
as workload3;

I expec description to appear in the second column.But It always appear at the end of columns. It may occured becaused spark RDD sorting.

AdmondGuo commented 2 years ago

Look at this: https://stackoverflow.com/questions/52434075/scala-spark-order-changes-when-writing-a-dataframe-to-a-csv-file

And you can fix this by use ET TableRepartition. HERE is the doc: https://docs.byzer.org/#/byzer-lang/zh-cn/extension/et/TableRepartition?id=%e8%a1%a8%e5%88%86%e5%8c%ba%e6%8f%92%e4%bb%b6-tablerepartition