frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
696 stars 145 forks source link

Cannot read datapackage from s3 #1596

Open barbuz opened 11 months ago

barbuz commented 11 months ago

Overview

I want to use Frictionless datapackages to provide metadata about some collections hosted on s3, but I'm encountering issues when trying to read these files. I can load the data fine as a Resource, and I can even validate it against a local tableschema, but if I try loading the datapackage I get the following error:

>>> pak = frictionless.Package('s3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json')
Traceback (most recent call last):
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/metadata.py", line 306, in metadata_retrieve
    response = session.get(descriptor, stream=True)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 600, in get
    return self.request("GET", url, **kwargs)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 695, in send
    adapter = self.get_adapter(url=request.url)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 792, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/package/factory.py", line 38, in __call__
    cls.from_descriptor(source, basepath=basepath, **options),  # type: ignore
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/metadata.py", line 162, in from_descriptor
    descriptor = cls.metadata_retrieve(descriptor)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/metadata.py", line 324, in metadata_retrieve
    raise FrictionlessException(Error(note=note)) from exception
frictionless.exception.FrictionlessException: [package-error] The data package has an error: cannot retrieve metadata "s3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json" because "No connection adapters were found for 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json'"

I have also tried opening a local copy of the datapackage with its resource path pointing to s3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet, but then the validation fails with:

>>> pak.validate()
{'valid': False,
 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.057},
 'warnings': [],
 'errors': [],
 'tasks': [{'name': 'data',
            'type': 'table',
            'valid': False,
            'place': 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet',
            'labels': [],
            'stats': {'errors': 1, 'warnings': 0, 'seconds': 0.026},
            'warnings': [],
            'errors': [{'type': 'source-error',
                        'title': 'Source Error',
                        'description': 'Data reading error because of not '
                                       'supported or inconsistent contents.',
                        'message': 'The data source has not supported or has '
                                   'inconsistent contents: '
                                   's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet',
                        'tags': [],
                        'note': 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet'}]}]}

Finally, I've done some experiments with the CLI but encountered the same errors there too. In particular, trying to validate the remote data against a local tableschema.json file worked, but if the tableschema was also hosted on s3 I got the error "No connection adapters were found for 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/tableschema.json'"

All the files used here should be public, so you can try replicating the issue. Please let me know if I'm doing something wrong or if this is an actual bug.

PeterBaker0 commented 10 months ago

We have a similar use case.

I replicated this issue and tried various combinations and couldn't get it to resolve correctly.

Does the AWS plugin expose all of the necessary parts to validate a whole data package, or is it only at the Resource level such as in the guide here? https://framework.frictionlessdata.io/docs/schemes/aws.html

roll commented 9 months ago

Thanks for reporting!