aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
388 stars 142 forks source link

Cryptic CLI error in SageMaker Studio (and probably other role-based environments?) #352

Open athewsey opened 5 months ago

athewsey commented 5 months ago

Hi team,

I was surprised to find today that the below does not work in the default Python notebook kernel of a SageMaker Studio JupyterLab space, when the notebook's IAM execution role has all the necessary permissions:

%pip install amazon-textract-textractor

!textractor start-document-analysis \
    --features LAYOUT --features TABLES \
    --s3-upload-path {s3_upload_uri} \
    --s3-output-path {s3_output_uri} \
    data/my-cool-document.pdf

Actual behaviour

When neither --region-name nor --profile-name are set, the CLI auto-configures the profile to "default", which causes the below error:

Traceback (most recent call last):
  File "/opt/conda/bin/textractor", line 8, in <module>
    sys.exit(textractor_cli())
  File "/opt/conda/lib/python3.10/site-packages/textractor/cli/cli.py", line 347, in textractor_cli
    extractor = Textractor(
  File "/opt/conda/lib/python3.10/site-packages/textractor/textractor.py", line 90, in __init__
    self.session = boto3.session.Session(profile_name=self.profile_name)
  File "/opt/conda/lib/python3.10/site-packages/boto3/session.py", line 90, in __init__
    self._setup_loader()
  File "/opt/conda/lib/python3.10/site-packages/boto3/session.py", line 131, in _setup_loader
    self._loader = self._session.get_component('data_loader')
  File "/opt/conda/lib/python3.10/site-packages/botocore/session.py", line 802, in get_component
    return self._components.get_component(name)
  File "/opt/conda/lib/python3.10/site-packages/botocore/session.py", line 1140, in get_component
    self._components[name] = factory()
  File "/opt/conda/lib/python3.10/site-packages/botocore/session.py", line 199, in <lambda>
    lambda: create_loader(self.get_config_variable('data_path')),
  File "/opt/conda/lib/python3.10/site-packages/botocore/session.py", line 323, in get_config_variable
    return self.get_component('config_store').get_config_variable(
  File "/opt/conda/lib/python3.10/site-packages/botocore/configprovider.py", line 465, in get_config_variable
    return provider.provide()
  File "/opt/conda/lib/python3.10/site-packages/botocore/configprovider.py", line 671, in provide
    value = provider.provide()
  File "/opt/conda/lib/python3.10/site-packages/botocore/configprovider.py", line 761, in provide
    scoped_config = self._session.get_scoped_config()
  File "/opt/conda/lib/python3.10/site-packages/botocore/session.py", line 422, in get_scoped_config
    raise ProfileNotFound(profile=profile_name)
botocore.exceptions.ProfileNotFound: The config profile (default) could not be found

Expected behaviour

In this environment the AWS_REGION environment variable is automatically set, but there are no CLI 'profile's. I suggest a better default behaviour would be to auto-discover the region from environment variables when present (e.g. os.environ.get("AWS_REGION")) and leave the profile alone?

Belval commented 5 months ago

I agree that this is counter-intuitive, we could do the same thing for extractor = Textractor().