Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

unstructured-ingest s3 command causes Fsspec.Downloader.download_config.download_dir to be None #3101

Open tuvalusoftware opened 1 month ago

tuvalusoftware commented 1 month ago

running the command:

unstructured-ingest \
   s3 \
   --remote-url s3://anticythera/\
   --anonymous \
   --output-dir /Users/anticythera/PycharmProjects/scientificProject/data/ \
   --num-processes 2

causes an error:

ERROR: /Users/anticythera/.cache/unstructured/ingest/pipeline/index/8485948ff856.json: [download]
unsupported operand type(s) for /: 'NoneType' and 'PosixPath'

resulting from Fsspec.Downloader.download_config.download_dir being None

I am running Mac OS 14.5

Stack Trace:

2024-05-26 06:55:10,617 MainProcess INFO     Calling DownloadStep with 1 docs
INFO: Calling DownloadStep with 1 docs
2024-05-26 06:55:10,617 MainProcess INFO     processing content async
INFO: processing content async
2024-05-26 06:55:10,619 MainProcess ERROR    Exception raised while running download
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.11/site-packages/unstructured/ingest/v2/pipeline/interfaces.py", line 97, in run_async
    return await self._run_async(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/unstructured/ingest/v2/pipeline/steps/download.py", line 84, in _run_async
    download_path = self.process.get_download_path(file_data=file_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/unstructured/ingest/v2/processes/connectors/fsspec/fsspec.py", line 240, in get_download_path
    self.download_config.download_dir / Path(file_data.source_identifiers.rel_path)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for /: 'NoneType' and 'PosixPath'
ERROR: Exception raised while running download
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.11/site-packages/unstructured/ingest/v2/pipeline/interfaces.py", line 97, in run_async
    return await self._run_async(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/unstructured/ingest/v2/pipeline/steps/download.py", line 84, in _run_async
    download_path = self.process.get_download_path(file_data=file_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/unstructured/ingest/v2/processes/connectors/fsspec/fsspec.py", line 240, in get_download_path
    self.download_config.download_dir / Path(file_data.source_identifiers.rel_path)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for /: 'NoneType' and 'PosixPath'
2024-05-26 06:55:10,622 MainProcess ERROR    1 failed documents:
ERROR: 1 failed documents:
2024-05-26 06:55:10,622 MainProcess ERROR    /Users/anticythera/.cache/unstructured/ingest/pipeline/index/8485948ff856.json: [download] unsupported operand type(s) for /: 'NoneType' and 'PosixPath'
ERROR: /Users/anticythera/.cache/unstructured/ingest/pipeline/index/8485948ff856.json: [download] unsupported operand type(s) for /: 'NoneType' and 'PosixPath'
MthwRobinson commented 1 month ago

Thanks @tuvalusoftware - we'll take a look at this as soon as we're able.