byzer-org / byzer-lang

Byzer (former MLSQL): A low-code open-source programming language for data pipeline, analytics and AI.
https://www.byzer.org
Apache License 2.0
1.83k stars 547 forks source link

"!delta show tables;" command shows incorrect table names #1417

Closed lwz9103 closed 2 years ago

lwz9103 commented 3 years ago

Follow below steps to find this problem

  1. save delta table
    
    set rawText='''
    {"id":1,"content":"MLSQL是一个好的语言","label":0.0},
    {"id":2,"content":"Spark是一个好的语言","label":1.0}
    {"id":3,"content":"MLSQL语言","label":0.0}
    {"id":4,"content":"MLSQL是一个好的语言","label":0.0}
    {"id":5,"content":"MLSQL是一个好的语言","label":1.0}
    {"id":6,"content":"MLSQL是一个好的语言","label":0.0}
    {"id":7,"content":"MLSQL是一个好的语言","label":0.0}
    {"id":8,"content":"MLSQL是一个好的语言","label":1.0}
    {"id":9,"content":"Spark好的语言","label":0.0}
    {"id":10,"content":"MLSQL是一个好的语言","label":0.0}
    ''';

load jsonStr.rawText as orginal_text_corpus;

save append orginal_text_corpus as delta./tmp/delta/table10;


2. show delta tables
![image](https://user-images.githubusercontent.com/23639010/113278272-4fa87080-9314-11eb-84b3-142e49b6e04e.png)

`lwz9103` is my user directory, there is no delta tables. In database `default`, I have save a delta named `/tmp/delta/table10`, not tmp
allwefantasy commented 3 years ago

In MLSQL, there are two ways to use delta lake:

  1. directory mode
  2. table mode

When you setup -streaming.datalake.path , then table mode is enabled. you should use like this:

save append orginal_text_corpus as delta.`db.table`;

otherwise you should use directory mode.

So , when using !delta show tables, maybe we can check -streaming.datalake.path is enabled. if enabled, then the command !delta show tables will be ok. if not, then maybe we can give some message to user.

lwz9103 commented 3 years ago

Yes, this parameter has been set as follows. /Users/wenzheng.liu/mlsql

Here's my startup program.

package streaming.core

object WilliamLocalSparkServiceApp {
  def main(args: Array[String]): Unit = {
    StreamingApp.main(Array(
      "-streaming.master", "local[*]",
      "-streaming.name", "god",
      "-streaming.rest", "true",
      "-streaming.thrift", "false",
      "-streaming.platform", "spark",
      "-spark.mlsql.enable.runtime.directQuery.auth", "true",
      //      "-streaming.ps.cluster.enable","false",
      "-streaming.enableHiveSupport","true",
//      "-spark.mlsql.datalake.overwrite.hive", "true",
//      "-spark.mlsql.auth.access_token", "mlsql",
      //"-spark.mlsql.enable.max.result.limit", "true",
      //"-spark.mlsql.restful.api.max.result.size", "7",
      //      "-spark.mlsql.enable.datasource.rewrite", "true",
      //      "-spark.mlsql.datasource.rewrite.implClass", "streaming.core.datasource.impl.TestRewrite",
      //"-streaming.job.file.path", "classpath:///test/init.json",
      "-streaming.spark.service", "true",
      "-streaming.job.cancel", "true",
      "-streaming.datalake.path", "/Users/wenzheng.liu/mlsql",

//      "-streaming.plugin.clzznames","tech.mlsql.plugins.ds.MLSQLExcelApp",

      // scheduler
      "-streaming.workAs.schedulerService", "false",
      "-streaming.workAs.schedulerService.consoleUrl", "http://127.0.0.1:9002",
      "-streaming.workAs.schedulerService.consoleToken", "mlsql",

      //      "-spark.sql.hive.thriftServer.singleSession", "true",
      "-streaming.rest.intercept.clzz", "streaming.rest.ExampleRestInterceptor",
      //      "-streaming.deploy.rest.api", "true",
      "-spark.driver.maxResultSize", "2g",
      "-spark.serializer", "org.apache.spark.serializer.KryoSerializer",
      //      "-spark.sql.codegen.wholeStage", "true",
      "-spark.ui.allowFramingFrom","*",
      "-spark.kryoserializer.buffer.max", "2000m",
      "-streaming.driver.port", "9003"
      //      "-spark.files.maxPartitionBytes", "10485760"

      //meta store
      //      "-streaming.metastore.db.type", "mysql",
      //      "-streaming.metastore.db.name", "app_runtime_full",
      //      "-streaming.metastore.db.config.path", "./__mlsql__/db.yml"

      //      "-spark.sql.shuffle.partitions", "1",
      //      "-spark.hadoop.mapreduce.job.run-local", "true"

      //"-streaming.sql.out.path","file:///tmp/test/pdate=20160809"

      //"-streaming.jobs","idf-compute"
      //"-streaming.driver.port", "9005"
      //"-streaming.zk.servers", "127.0.0.1",
      //"-streaming.zk.conf_root_dir", "/streamingpro/jack"
    ))
  }
}

For the second point, It should have been reported wrong by now.

image

lwz9103 commented 3 years ago

So, I want to know how to use directory mode and list tables correctly

lwz9103 commented 3 years ago

Dev Design

Spark delta itself does not have catalog. mlsql manages delta tables by specifying delta's table storage root path (config -streaming.datalake.path). Due to user habits, we still need to retain the directory mode to use delta.

Consider behavior in the following situations:

  1. Enable -streaming.datalake.path, and then use directory mode normally.

    load delta.`/tmp/workflow` where mode="path" as t1;
    save t1 as delta.`/tmp/t1` where mode="path";

    expect result

  2. Enable -streaming.datalake.path, and then use directory mode without parameter.

    load delta.`/tmp/workflow` as t1;
    save t1 as delta.`/tmp/t1`;

    expect to give an error message and suggest user to table mode or add a parameter explicitly.

  3. Enable -streaming.datalake.path, and save delta table in streaming.datalake.path

    set streaming.datalake.path=/mlsql/delta

    save t1 as delta.`/mlsql/delta/xxx` where mode="path";

    expect to give an error message and suggest use table mode or change another path.

lwz9103 commented 3 years ago

Test Evidence

  1. Enable -streaming.datalake.path, and then use directory mode normally.

image

  1. Enable -streaming.datalake.path, and then use directory mode without parameter.

image

  1. Enable -streaming.datalake.path, and save delta table in streaming.datalake.path

image