databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.32k stars 356 forks source link

Does Koalas support reading hive table by default? #2194

Open amznero opened 2 years ago

amznero commented 2 years ago

Hi,

I'm trying to use Koalas to load a hive table on the remote cluster. In https://koalas.readthedocs.io/en/latest/reference/io.html#spark-metastore-table, it says that I can use ks.read_table API to read spark-table, but it failed when I use ks.read_table to read the table.

import pandas as pd
import numpy as np
import databricks.koalas as ks
from pyspark.sql import SparkSession

koalas_df = ks.read_table("xxx.yyy")

Error log:

AnalysisException: "Table or view not found: `xxx`.`yyy`;;\n'UnresolvedRelation `xxx`.`yyy`\n"

However, I can load it successfully by directly using pyspark+pandas+pyarrow.

some snippets

from pyspark.sql import SparkSession

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

spark_df = spark.read.table("xxx")
pandas_df = spark_df.toPandas()
...

And I check some source codes in https://github.com/databricks/koalas/blob/e971d6f37ede45297bbf9d509ae2a7b51717f322/databricks/koalas/namespace.py#L556

It uses default_session(without option configures) to load the table, but it does not set enableHiveSupport option.

https://github.com/databricks/koalas/blob/e971d6f37ede45297bbf9d509ae2a7b51717f322/databricks/koalas/utils.py#L433-L456

So, I'm a little confused about ks.read_table, where does it load tables from? Maybe link to Spark-warehouse?