Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.1k stars 2.52k forks source link

Key not found "ADLSGen2" when using `to_spark_dataframe` #1503

Open malthe opened 3 years ago

malthe commented 3 years ago

I'm creating a dataset directly using a URL (relying on identity-based access):

dataset = Dataset.Tabular.from_parquet_files("https://<account>.dfs.core.windows.net/<path>")

(This prompts my browser to start a login-process.)

While dataset.to_pandas_dataframe() works fine, when I try dataset.to_spark_dataframe() I get the following Java traceback:

: java.util.NoSuchElementException: key not found: ADLSGen2
    at scala.collection.MapLike.default(MapLike.scala:235)
    at scala.collection.MapLike.default$(MapLike.scala:234)
    at scala.collection.AbstractMap.default(Map.scala:63)
    at scala.collection.MapLike.apply(MapLike.scala:144)
    at scala.collection.MapLike.apply$(MapLike.scala:143)
    at scala.collection.AbstractMap.apply(Map.scala:63)
    at com.microsoft.dprep.io.StreamInfoFileSystem$.toFileSystemPath(StreamInfoFileSystem.scala:68)
    at com.microsoft.dprep.execution.Storage$.expandHdfsPath(Storage.scala:37)
    at com.microsoft.dprep.execution.executors.GetFilesExecutor$.$anonfun$getFiles$1(GetFilesExecutor.scala:18)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at com.microsoft.dprep.execution.executors.GetFilesExecutor$.getFiles(GetFilesExecutor.scala:12)
    at com.microsoft.dprep.execution.LariatDataset$.getFiles(LariatDataset.scala:32)
    at com.microsoft.dprep.execution.PySparkExecutor.getFiles(PySparkExecutor.scala:225)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:834)

This is using "com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-62-25d40cff-SNAPSHOT" and PySpark 3.1.2.

What might cause this error?

The Java code is called from a generated Python module which shows where the "ADLSGen2" key comes from:

# ...
lds0 = jex.getFiles(
    [{"searchPattern":"https://<account>.dfs.core.windows.net/<path>",
       "handler":"ADLSGen2",
       "arguments":{"credential":""}
    }], 
    secrets
)
# ...
ynpandey commented 3 years ago

As of June 2021, dataset.to_spark_dataframe() does not support ADLS Gen2. This support will be available in the near future.