cryeo / sparksql-magic

Spark SQL magic command for Jupyter notebooks
MIT License
35 stars 2 forks source link

SparkSQL simple way of reading CSV as local Spark Dataframes #4

Open CarloNicolini opened 2 years ago

CarloNicolini commented 2 years ago

Dear @cryeo,

I really like your library as it makes possible to integrate SQL syntax directly into cells, that's a nice piece of work!

However I would like to hear from you what's the best way to read .csv local files into Spark Dataframes by means of the SparkSQL syntax without the creation of a local Hive database. I've noticed that the two folders metastore_db and spark_warehouse are always produced in the same folder of the notebook when I create a table, do they act as a local database?

When I run this cell:

%%sparksql
CREATE TABLE IF NOT EXISTS account
USING csv
OPTIONS (
    path 'data.csv',
    header true,
    inferSchema true,
    sep ','
);

I get this warning in the Jupyter notebook:

21/12/20 09:03:12 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
21/12/20 09:03:12 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
21/12/20 09:03:16 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
21/12/20 09:03:16 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore myuser@localhost
21/12/20 09:03:16 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
21/12/20 09:03:25 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `default`.`account` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
21/12/20 09:03:26 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
21/12/20 09:03:27 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
21/12/20 09:03:27 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
21/12/20 09:03:27 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
-chgrp: 'Users\Domain' does not match expected pattern for group
Usage: hadoop fs [generic options]
    [-appendToFile <localsrc> ... <dst>]
    [-cat [-ignoreCrc] <src> ...]
    [-checksum <src> ...]
    [-chgrp [-R] GROUP PATH...]
    [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
    [-chown [-R] [OWNER][:[GROUP]] PATH...]
    [-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
    [-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
    [-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
    [-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
    [-createSnapshot <snapshotDir> [<snapshotName>]]
    [-deleteSnapshot <snapshotDir> <snapshotName>]
    [-df [-h] [<path> ...]]
    [-du [-s] [-h] [-v] [-x] <path> ...]
    [-expunge]
    [-find <path> ... <expression> ...]
    [-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
    [-getfacl [-R] <path>]
    [-getfattr [-R] {-n name | -d} [-e en] <path>]
    [-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
    [-head <file>]
    [-help [cmd ...]]
    [-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
    [-mkdir [-p] <path> ...]
    [-moveFromLocal <localsrc> ... <dst>]
    [-moveToLocal <src> <localdst>]
    [-mv <src> ... <dst>]
    [-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
    [-renameSnapshot <snapshotDir> <oldName> <newName>]
    [-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
    [-rmdir [--ignore-fail-on-non-empty] <dir> ...]
    [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
    [-setfattr {-n name [-v value] | -x name} <path>]
    [-setrep [-R] [-w] <rep> <path> ...]
    [-stat [format] <path> ...]
    [-tail [-f] <file>]
    [-test -[defsz] <path>]
    [-text [-ignoreCrc] <src> ...]
    [-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
    [-touchz <path> ...]
    [-truncate [-w] <length> <path> ...]
    [-usage [cmd ...]]

Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...

I do understand I don't have Apache Hive installed, but isn't possible to simply read the CSV file as SparkDataFrame without all these warnings? Doing with the PySpark API is much easier, as a spark.read.csv('myfile.csv') suffices and no local databases are created.

Wh1isper commented 1 year ago

@CarloNicolini Hi, I have developed a similar module with support for the latest spark server-client mode, maybe you can try it out:

Wh1isper/sparglim#sql-magic

Install with:

pip install sparglim["magic"]

Support SQL statement u need 🎉

%%sql CREATE TABLE tb_people
USING json
OPTIONS (path "/path/to/file.json");
Show tables;