OutputFileDataset requires libfuse as dependency?

BSofo commented 4 years ago

I recently upgraded my azureml-sdk to 1.16.0 and was running my PythonScriptStep roll.py spark dataset. But kept getting errors about libfuse.

Region: westus2
Run Id: 4ef30c7f-cf30-48a3-9d0c-a0e6882958ef
PipelineRun Id: 0e36abc1-afe3-4e56-b316-d8617d65149c

70_driver_log.txt

Logging warning in history service: ERROR:: Dataset  failed. . Exception Details:Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/4ef30c7f-cf30-48a3-9d0c-a0e6882958ef/mounts/workspaceblobstore/azureml/4ef30c7f-cf30-48a3-9d0c-a0e6882958ef/azureml-setup/context_managers.py", line 385, in __enter__
    self.datasets.__enter__()
  File "/azureml-envs/azureml_29037f86d36e2b5b4f047e45f790fdb9/lib/python3.6/site-packages/azureml/data/context_managers.py", line 108, in __enter__
    self._mount_or_download(key, data_configuration)
  File "/azureml-envs/azureml_29037f86d36e2b5b4f047e45f790fdb9/lib/python3.6/site-packages/azureml/data/context_managers.py", line 175, in _mount_or_download
    self._mount_readonly(name, dataset, target_path)
  File "/azureml-envs/azureml_29037f86d36e2b5b4f047e45f790fdb9/lib/python3.6/site-packages/azureml/data/context_managers.py", line 193, in _mount_readonly
    mount_options = dataprep_fuse().MountOptions(free_space_required=free_space_required)
  File "/azureml-envs/azureml_29037f86d36e2b5b4f047e45f790fdb9/lib/python3.6/site-packages/azureml/data/_dataprep_helper.py", line 46, in dataprep_fuse
    import azureml.dataprep.fuse.dprepfuse as _dprep_fuse
  File "/azureml-envs/azureml_29037f86d36e2b5b4f047e45f790fdb9/lib/python3.6/site-packages/azureml/dataprep/fuse/dprepfuse.py", line 4, in <module>
    from ._filecache import FileCache
  File "/azureml-envs/azureml_29037f86d36e2b5b4f047e45f790fdb9/lib/python3.6/site-packages/azureml/dataprep/fuse/_filecache.py", line 9, in <module>
    from .vendor.fuse import FuseOSError
  File "/azureml-envs/azureml_29037f86d36e2b5b4f047e45f790fdb9/lib/python3.6/site-packages/azureml/dataprep/fuse/vendor/fuse.py", line 115, in <module>
    raise EnvironmentError('Unable to find libfuse')
OSError: Unable to find libfuse

Here is my spark-requirement.txt file:

azureml-dataprep[blobfuse]==2.4.0
fusepy==3.0.1
pyspark==2.4.4
pandas==0.25.3
pip==19.3.1
numpy==1.18.1

packages in my local environment

And here is a snippet of my pipeline with the spark config and PythonScriptStep:

# Environment set up for PySpark Compute
spark_env = plf.get_environment(env_name='spark_env',
                                req_path=os.path.join(os.getcwd(), 'compute/aml_config/spark-requirements.txt'),
                                enable_docker=True,
                                docker_base_image='microsoft/mmlspark:0.16',
                                url=index_url + ' ' + mlpackage_req)

# use pyspark framework
spark_run_config = RunConfiguration(framework="pyspark")
spark_run_config.environment = spark_env

roll_step = PythonScriptStep(
    name='roll.py',
    script_name='roll.py',
    arguments=['--input_dir', joined_data,
                '--output_dir', rolled_data,
                '--script_dir', ".",
                '--min_date', '2015-06-30',
                '--pct_rank', 'True'],
    compute_target=compute_target_spark,
    inputs=[joined_data],
    outputs=[rolled_data],
    runconfig=spark_run_config,
    source_directory=os.path.join(os.getcwd(), 'compute', 'roll'),
    allow_reuse=pipeline_reuse
)

lostmygithubaccount commented 4 years ago

I believe you'll need to update libfuse in the base docker image or otherwise install it in the environment you're using

dataders commented 4 years ago

@lostmygithubaccount thanks for jumping in! What's weird is that we've been using MML Spark's Docker image, microsoft/mmlspark:0.16 , as a base image for a PythonScriptStep that reads from and writes toPipelineData for about a year now. But now because we're (trying) to use OutputFileDatasetConfig, this means our base docker image requires libfuse? I thought that it was installed automatically as part of the run setup... cc: @MayMSFT

lostmygithubaccount commented 4 years ago

I see - so it sounds like PipelineData did not require libfuse, but the new OutputFileDatasetConfig does. Will let @MayMSFT and data team comment

dataders commented 4 years ago

@rongduan-zhu

rongduan-zhu commented 4 years ago

@swanderz the reason we need libfuse is that we mount inside the user's container. We are working on a new architecture to remove this restriction but until then, libfuse is needed in the docker image in order for mount to work.

Looking at the stack trace above, it looks like the input is also using dataset and the mode is set to mount. One way to avoid this issue is to use download mode and upload mode for inputs and ouputs respectively. The downside of this is if all your data doesn't fit onto disk, then it won't work and if the job doesn't complete successfully, no data will be uploaded.

meyetman commented 4 years ago

Closing this issue #please-close.

lostmygithubaccount commented 3 years ago

reopen per new policy - is this fixed?

Azure / MachineLearningNotebooks

OutputFileDataset requires libfuse as dependency? #1203