apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.15k stars 2.14k forks source link

Flink: Make Hadoop an optional dependency #7332

Open Fokko opened 1 year ago

Fokko commented 1 year ago

Feature Request / Improvement

Playing around with pyflink and noticed that the Hadoop dependency is required when using the REST catalog:

➜  ~ python3.9                             
Python 3.9.16 (main, Dec  7 2022, 10:06:04) 
[Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> 
>>> from pyflink.datastream import StreamExecutionEnvironment
>>> 
>>> env = StreamExecutionEnvironment.get_execution_environment()
>>> iceberg_flink_runtime_jar = "/Users/fokkodriesprong/Desktop/iceberg/flink/v1.17/flink-runtime/build/libs/iceberg-flink-runtime-1.17-1.3.0-SNAPSHOT.jar"
>>> 
>>> env.add_jars("file://{}".format(iceberg_flink_runtime_jar))
>>> 
>>> from pyflink.table import StreamTableEnvironment
>>> 
>>> table_env = StreamTableEnvironment.create(env)
>>> 
>>> table_env.execute_sql("""
... CREATE CATALOG tabular WITH (
...     'type'='iceberg', 
...     'catalog-type'='rest',
...     'uri'='https://api.tabular.io/ws',
...     'credential'='t-tcEe4Ihp4eM:pyTlx_4ayKV7N54gXuBmMotVFLU'
... )
... """)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.9/site-packages/pyflink/table/table_environment.py", line 837, in execute_sql
    return TableResult(self._j_tenv.executeSql(stmt))
  File "/opt/homebrew/lib/python3.9/site-packages/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
  File "/opt/homebrew/lib/python3.9/site-packages/pyflink/util/exceptions.py", line 146, in deco
    return f(*a, **kw)
  File "/opt/homebrew/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o23.executeSql.
: java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
    at org.apache.iceberg.flink.FlinkCatalogFactory.clusterHadoopConf(FlinkCatalogFactory.java:211)
    at org.apache.iceberg.flink.FlinkCatalogFactory.createCatalog(FlinkCatalogFactory.java:139)
    at org.apache.flink.table.factories.FactoryUtil.createCatalog(FactoryUtil.java:414)
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.createCatalog(TableEnvironmentImpl.java:1466)
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:1212)
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:765)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282)
    at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79)
    at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    ... 17 more

When using a Hadoop or Hive catalog, this makes perfect sense but would be nice to make it optional when using the REST catalog.

Query engine

Flink

Fokko commented 1 year ago

@rdblue Any chance we can get this in? The Parquet 1.13.1 release is just shy of one binding vote: https://lists.apache.org/thread/0yokbmfcbhz76ftjbxktwxfo5vrt57od

sfsf9797 commented 1 year ago

Hi, I am encountering this issue as well. Any temporary workaround that I can use to make it works?

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 8 months ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'