awslabs / python-deequ

Python API for Deequ
Apache License 2.0
702 stars 132 forks source link

Issue in running Metric Repository from Jupyter notebook hosted on JupyterHub #5

Closed naveencha closed 3 years ago

naveencha commented 3 years ago

I ran below mentioned code in Jupyter Notebook which is hosted on jupytor hub enviornment as container:

from pydeequ.repository import from pydeequ.analyzers import

metrics_file = FileSystemMetricsRepository.helper_metrics_file(spark, 'metrics1.json') repository = FileSystemMetricsRepository(spark, metrics_file) key_tags = {'tag': 'pydeequ hello world'} resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)

analysisResult = AnalysisRunner(spark) \ .onData(df_spark) \ .addAnalyzer(Size()) \ .useRepository(repository) \ .saveOrAppendResult(resultKey) \ .run()

Expected behavior This is working fine on local machine , It should create a repository file and store Size value in it.

Actual output: Error is displayed on page.

Py4JJavaError: An error occurred while calling o62.run. : java.net.NoRouteToHostException: No Route to Host from ip----/... to ip---**-***.us-west-2.compute.internal:8020 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:758) at org.apache.hadoop.ipc.Client.call(Client.java:1479) at org.apache.hadoop.ipc.Client.call(Client.java:1412) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy27.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1452) at com.amazon.deequ.repository.fs.FileSystemMetricsRepository$.readFromFileOnDfs(FileSystemMetricsRepository.scala:203) at com.amazon.deequ.repository.fs.FileSystemMetricsRepositoryMultipleResultsLoader.get(FileSystemMetricsRepository.scala:130) at com.amazon.deequ.repository.fs.FileSystemMetricsRepository.loadByKey(FileSystemMetricsRepository.scala:67) at com.amazon.deequ.analyzers.runners.AnalysisRunner$$anonfun$saveOrAppendResultsIfNecessary$1$$anonfun$apply$1.apply(AnalysisRunner.scala:225) at com.amazon.deequ.analyzers.runners.AnalysisRunner$$anonfun$saveOrAppendResultsIfNecessary$1$$anonfun$apply$1.apply(AnalysisRunner.scala:223) at scala.Option.foreach(Option.scala:257) at com.amazon.deequ.analyzers.runners.AnalysisRunner$$anonfun$saveOrAppendResultsIfNecessary$1.apply(AnalysisRunner.scala:223) at com.amazon.deequ.analyzers.runners.AnalysisRunner$$anonfun$saveOrAppendResultsIfNecessary$1.apply(AnalysisRunner.scala:222) at scala.Option.foreach(Option.scala:257) at com.amazon.deequ.analyzers.runners.AnalysisRunner$.saveOrAppendResultsIfNecessary(AnalysisRunner.scala:222) at com.amazon.deequ.analyzers.runners.AnalysisRunner$.doAnalysisRun(AnalysisRunner.scala:199) at com.amazon.deequ.analyzers.runners.AnalysisRunBuilder.run(AnalysisRunBuilder.scala:101) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712) at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528) at org.apache.hadoop.ipc.Client.call(Client.java:1451) ... 40 more

Desktop (please complete the following information):

Additional context This is working fine on local standalone machine jupyter instance.

gucciwang commented 3 years ago

Hello! Can you provide a little more information on the environment you're running it in? We have tested and confirmed that SageMaker Notebook instances can run no problem. but not with JupyterHub as a container.

Furthermore, could you try running this metrics repository example in your current environment to help isolate if it's an environment issue?

My guess from your description of running in a "jupyter hub environment as a container" may be that it has certain restrictions on creating local hadoop cache for the run. Perhaps you can similarly give a try with an alternate Metrics Repository: the InMemoryMetricsRepository. Here's an example on you'd run it from your code:

from pydeequ.repository import *
from pydeequ.analyzers import *

repository = InMemoryMetricsRepository(spark)
key_tags = {'tag': 'pydeequ hello world'}
resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)

analysisResult = AnalysisRunner(spark)
.onData(df_spark)
.addAnalyzer(Size())
.useRepository(repository)
.saveOrAppendResult(resultKey)
.run()
naveencha commented 3 years ago

InMemoryMetricsRepository is working fine . FileSystemMetricsRepository is still creating problem , It stores matrix.json file at /tmp/ location but not able read repository while using it.

gucciwang commented 3 years ago

Can you provide more details on what code is failing? I am unsure how you are trying to "read" the repository.

In the FileSystemMetricsRepository tutorial, we read results from a repository like so after an analysis run:

analysisResult_metRep = repository.load() \
                            .before(ResultKey.current_milli_time()) \
                            .getSuccessMetricsAsDataFrame()

analysisResult_metRep.show()
gucciwang commented 3 years ago

Closing due to inactivity -- reopen if you still need help! :)