aws / sagemaker-spark-container

The SageMaker Spark Container is a Docker image used to run data processing workloads with the Spark framework on Amazon SageMaker.
Apache License 2.0
36 stars 74 forks source link

support Spark 3.2 and EMR 6.7 #98

Closed xiaoxshe closed 2 years ago

xiaoxshe commented 2 years ago

Issue #, if available:

Description of changes:

The reason for this task is spark >= 3.2 added pandas API (covers 90%), which make it easy to use for pandas users. Introducing pandas API on Apache Spark to unify small data API and big data API (learn more here). Completing the ANSI SQL compatability mode to simplify migration of SQL workloads. Productionizing adaptive query execution to speed up Spark SQL at runtime. Introducing RocksDB statestore to make state processing more scalable.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

xiaoxshe commented 2 years ago

running test-unit

test/unit/test_bootstrapper.py::test_recursive_deserialize_user_configuration PASSED test/unit/test_bootstrapper.py::test_site_multiple_classifications PASSED test/unit/test_bootstrapper.py::test_env_classification PASSED test/unit/test_bootstrapper.py::test_copy_aws_jars PASSED test/unit/test_bootstrapper.py::test_bootstrap_smspark_submit PASSED test/unit/test_bootstrapper.py::test_bootstrap_history_server PASSED test/unit/test_bootstrapper.py::test_wait_for_hadoop PASSED test/unit/test_bootstrapper.py::test_copy_cluster_config PASSED test/unit/test_bootstrapper.py::test_start_hadoop_daemons_on_primary PASSED test/unit/test_bootstrapper.py::test_start_hadoop_daemons_on_worker PASSED test/unit/test_bootstrapper.py::test_spark_standalone_primary PASSED test/unit/test_bootstrapper.py::test_set_regional_configs PASSED test/unit/test_bootstrapper.py::test_set_regional_configs_empty PASSED test/unit/test_bootstrapper.py::test_get_regional_configs_cn PASSED test/unit/test_bootstrapper.py::test_get_regional_configs_gov PASSED test/unit/test_bootstrapper.py::test_get_regional_configs_us PASSED test/unit/test_bootstrapper.py::test_get_regional_configs_missing_region PASSED test/unit/test_bootstrapper.py::test_load_processing_job_config PASSED test/unit/test_bootstrapper.py::test_load_processing_job_config_fallback PASSED test/unit/test_bootstrapper.py::test_load_instance_type_info PASSED test/unit/test_bootstrapper.py::test_set_yarn_spark_resource_config PASSED test/unit/test_bootstrapper.py::test_set_yarn_spark_resource_config_fallback PASSED test/unit/test_bootstrapper.py::test_get_yarn_spark_resource_config PASSED test/unit/test_cli.py::test_submit[missing APP arg should fail] PASSED test/unit/test_cli.py::test_submit[invalid spark options should fail] PASSED test/unit/test_cli.py::test_submit[happy path should pass] PASSED test/unit/test_cli.py::test_submit[valid spark option should pass] PASSED test/unit/test_cli.py::test_submit[single local jar should pass0] PASSED test/unit/test_cli.py::test_submit[list of local jars should pass0] PASSED test/unit/test_cli.py::test_submit[s3 url to jar should pass0] PASSED test/unit/test_cli.py::test_submit[s3a url to jar should pass0] PASSED test/unit/test_cli.py::test_submit[multiple s3 urls to jar should pass0] PASSED test/unit/test_cli.py::test_submit[mixed s3 urls to jars and local paths should pass0] PASSED test/unit/test_cli.py::test_submit[relative paths should fail0] PASSED test/unit/test_cli.py::test_submit[nonexistent paths should fail0] PASSED test/unit/test_cli.py::test_submit[directory with no files should fail0] PASSED test/unit/test_cli.py::test_submit[single local jar should pass1] PASSED test/unit/test_cli.py::test_submit[list of local jars should pass1] PASSED test/unit/test_cli.py::test_submit[s3 url to jar should pass1] PASSED test/unit/test_cli.py::test_submit[s3a url to jar should pass1] PASSED test/unit/test_cli.py::test_submit[multiple s3 urls to jar should pass1] PASSED test/unit/test_cli.py::test_submit[mixed s3 urls to jars and local paths should pass1] PASSED test/unit/test_cli.py::test_submit[relative paths should fail1] PASSED test/unit/test_cli.py::test_submit[nonexistent paths should fail1] PASSED test/unit/test_cli.py::test_submit[directory with no files should fail1] PASSED test/unit/test_cli.py::test_submit[single local jar should pass2] PASSED test/unit/test_cli.py::test_submit[list of local jars should pass2] PASSED test/unit/test_cli.py::test_submit[s3 url to jar should pass2] PASSED test/unit/test_cli.py::test_submit[s3a url to jar should pass2] PASSED test/unit/test_cli.py::test_submit[multiple s3 urls to jar should pass2] PASSED test/unit/test_cli.py::test_submit[mixed s3 urls to jars and local paths should pass2] PASSED test/unit/test_cli.py::test_submit[relative paths should fail2] PASSED test/unit/test_cli.py::test_submit[nonexistent paths should fail2] PASSED test/unit/test_cli.py::test_submit[directory with no files should fail2] PASSED test/unit/test_cli.py::test_submit[quotes are handled correctly] PASSED test/unit/test_config.py::test_core_site_xml PASSED test/unit/test_config.py::test_hadoop_env_sh PASSED test/unit/test_config.py::test_hadoop_log4j PASSED test/unit/test_config.py::test_hive_env PASSED test/unit/test_config.py::test_hive_log4j PASSED test/unit/test_config.py::test_hive_exec_log4j PASSED test/unit/test_config.py::test_hive_site PASSED test/unit/test_config.py::test_spark_defaults_conf PASSED test/unit/test_config.py::test_spark_env PASSED test/unit/test_config.py::test_spark_log4j_properties PASSED test/unit/test_config.py::test_spark_hive_site PASSED test/unit/test_config.py::test_spark_metrics_properties PASSED test/unit/test_config.py::test_yarn_env PASSED test/unit/test_config.py::test_yarn_size PASSED test/unit/test_errors.py::test_algorithm_error PASSED test/unit/test_errors.py::test_exit PASSED test/unit/test_history_server_cli.py::test_run_history_server PASSED test/unit/test_history_server_cli.py::test_submit[When arguments are set, should be passed job manager] PASSED test/unit/test_history_server_utils.py::test_config_history_server_with_env_variable spark.history.fs.logDirectory=s3://bucket/spark-events PASSED test/unit/test_history_server_utils.py::test_config_history_server_without_env_variable PASSED test/unit/test_history_server_utils.py::test_start_history_server PASSED test/unit/test_nginx_utils.py::test_start_nginx PASSED test/unit/test_nginx_utils.py::test_write_nginx_default_conf PASSED test/unit/test_nginx_utils.py::test_write_nginx_default_conf_without_domain_name PASSED test/unit/test_nginx_utils.py::test_copy_nginx_default_conf PASSED test/unit/test_spark_event_logs_publisher.py::test_run_with_event_log_dir PASSED test/unit/test_spark_event_logs_publisher.py::test_run_with_spark_events_s3_uri PASSED test/unit/test_status.py::test_status_app PASSED test/unit/test_status.py::test_status_server PASSED test/unit/test_status.py::test_status_map_one_host PASSED test/unit/test_status.py::test_status_map_multiple_hosts PASSED test/unit/test_status.py::test_status_map_propagate_errors PASSED test/unit/test_status.py::test_status_map_http_error PASSED test/unit/test_waiter.py::test_waiter PASSED test/unit/test_waiter.py::test_waiter_timeout PASSED test/unit/test_waiter.py::test_waiter_pred_fn_errors PASSED

xiaoxshe commented 2 years ago

make install-container-library

No known security vulnerabilities found.

cixuuz commented 2 years ago

Hi Thanks for upgrading the spark to 3.2. I'd like to confirm that what minor version of this Spark. Is this Spark 3.2.1? Thanks!