aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.92k stars 699 forks source link

Moving feature specific dependencies to optional installs in 3.0? #1983

Closed jmahlik closed 1 year ago

jmahlik commented 1 year ago

Is your idea related to a problem? Please describe. There are quite a few issues related to dependencies with the current installation setup. In our organization, the dependency conflicts/issues are getting to the point of considering removing aws-sdk-pandas as a dependency and re-implementing some of the functionality.

I (and likely others) am hoping this situation can be improved upon, because aws-sdk-pandas IS an awesome library! Happy to help make this happen.

Main issues

  1. Heavy load of transitive dependencies
    • It seems many users might be using one or two connectors/services, but with the current setup, they are getting dependencies for a lot of unused functionality.
    • Longer download times, even if one is only using one of the connectors/services like s3
    • Larger attack surface to manage (see below)
  2. Quite restrictive on the dependency versions
    • Prevents users from updating
    • Users who follow the best practice of pinning or restricting dependencies at the application level are unable to get bug fixes to critical packages like numpy
    • I understand there are reasons for restricting to a degree, but the current setup is overly restrictive, mixed with the current release cadence it creates a major roadblock
    • Maybe consider releasing daily or weekly if it's imperative the dependency restrictions not be relaxed

Describe the solution you'd like With an upcoming 3.0 release, I wonder if it would be possible to comb though the dependencies and move the specific connector/service related deps to a service-related extra. i.e everything for redshift goes in a "redshift" extra, everything for lakeformation goes in a "lakeformation" extra etc. A major release would be the perfect timing for this IMO.

Dask has a pretty good setup/example of how to handle this and educate users about extras. Maybe consider doing something similar for aws-sdk-pandas?

https://docs.dask.org/en/stable/install.html#pip

Additionally, asking for the dependency restrictions to be relaxed so users can incorporate bug fixes. Mostly around pandas, pyarrow, boto and numpy. It should be up to the user to pin and or upgrade. If downstream isn't following a best practice, educate them on how to.

Transitives

On a fresh venv I end up with the following 52 packages from installing awswrangler with no extras. For comparison, a venv with the "core" requirements (pandas, numpy pyarrow, boto3) installed there are 16 packages. All of these are not needed for interacting with an individual service.

pip list

aenum              3.1.11
aiohttp            3.8.1
aiosignal          1.3.1
asn1crypto         1.5.1
async-timeout      4.0.2
attrs              22.2.0
awswrangler        2.19.0
backoff            2.2.1
beautifulsoup4     4.11.2
boto3              1.26.62
botocore           1.29.62
certifi            2022.12.7
charset-normalizer 2.1.1
decorator          5.1.1
et-xmlfile         1.1.0
frozenlist         1.3.3
gremlinpython      3.6.2
idna               3.4
isodate            0.6.1
jmespath           1.0.1
jsonpath-ng        1.5.3
lxml               4.9.2
multidict          6.0.4
nest-asyncio       1.5.6
numpy              1.23.4
openpyxl           3.0.10
opensearch-py      2.1.1
packaging          23.0
pandas             1.5.1
pg8000             1.29.4
pip                22.3.1
pipdeptree         2.3.3
ply                3.11
progressbar2       4.2.0
pyarrow            10.0.1
PyMySQL            1.0.2
python-dateutil    2.8.2
python-utils       3.4.5
pytz               2022.7.1
redshift-connector 2.0.910
requests           2.28.2
requests-aws4auth  1.2.1
s3transfer         0.6.0
scramp             1.4.4
setuptools         65.6.3
six                1.16.0
soupsieve          2.3.2.post1
urllib3            1.26.14
wheel              0.38.4
yarl               1.8.2

Output from pipdeptree.

awswrangler==2.19.0
  - backoff [required: >=1.11.1,<3.0.0, installed: 2.2.1]
  - boto3 [required: >=1.24.11,<2.0.0, installed: 1.26.62]
    - botocore [required: >=1.29.62,<1.30.0, installed: 1.29.62]
      - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
      - python-dateutil [required: >=2.1,<3.0.0, installed: 2.8.2]
        - six [required: >=1.5, installed: 1.16.0]
      - urllib3 [required: >=1.25.4,<1.27, installed: 1.26.14]
    - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
    - s3transfer [required: >=0.6.0,<0.7.0, installed: 0.6.0]
      - botocore [required: >=1.12.36,<2.0a.0, installed: 1.29.62]
        - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
        - python-dateutil [required: >=2.1,<3.0.0, installed: 2.8.2]
          - six [required: >=1.5, installed: 1.16.0]
        - urllib3 [required: >=1.25.4,<1.27, installed: 1.26.14]
  - botocore [required: >=1.27.11,<2.0.0, installed: 1.29.62]
    - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
    - python-dateutil [required: >=2.1,<3.0.0, installed: 2.8.2]
      - six [required: >=1.5, installed: 1.16.0]
    - urllib3 [required: >=1.25.4,<1.27, installed: 1.26.14]
  - gremlinpython [required: >=3.5.2,<4.0.0, installed: 3.6.2]
    - aenum [required: >=1.4.5,<4.0.0, installed: 3.1.11]
    - aiohttp [required: >=3.8.0,<=3.8.1, installed: 3.8.1]
      - aiosignal [required: >=1.1.2, installed: 1.3.1]
        - frozenlist [required: >=1.1.0, installed: 1.3.3]
      - async-timeout [required: >=4.0.0a3,<5.0, installed: 4.0.2]
      - attrs [required: >=17.3.0, installed: 22.2.0]
      - charset-normalizer [required: >=2.0,<3.0, installed: 2.1.1]
      - frozenlist [required: >=1.1.1, installed: 1.3.3]
      - multidict [required: >=4.5,<7.0, installed: 6.0.4]
      - yarl [required: >=1.0,<2.0, installed: 1.8.2]
        - idna [required: >=2.0, installed: 3.4]
        - multidict [required: >=4.0, installed: 6.0.4]
    - isodate [required: >=0.6.0,<1.0.0, installed: 0.6.1]
      - six [required: Any, installed: 1.16.0]
    - nest-asyncio [required: Any, installed: 1.5.6]
  - jsonpath-ng [required: >=1.5.3,<2.0.0, installed: 1.5.3]
    - decorator [required: Any, installed: 5.1.1]
    - ply [required: Any, installed: 3.11]
    - six [required: Any, installed: 1.16.0]
  - numpy [required: >=1.21.0,<=1.23.4, installed: 1.23.4]
  - openpyxl [required: >=3.0.0,<3.1.0, installed: 3.0.10]
    - et-xmlfile [required: Any, installed: 1.1.0]
  - opensearch-py [required: >=1,<3, installed: 2.1.1]
    - certifi [required: Any, installed: 2022.12.7]
    - requests [required: >=2.4.0,<3.0.0, installed: 2.28.2]
      - certifi [required: >=2017.4.17, installed: 2022.12.7]
      - charset-normalizer [required: >=2,<4, installed: 2.1.1]
      - idna [required: >=2.5,<4, installed: 3.4]
      - urllib3 [required: >=1.21.1,<1.27, installed: 1.26.14]
    - urllib3 [required: >=1.21.1,<2, installed: 1.26.14]
  - pandas [required: >=1.2.0,<=1.5.1,<2.0.0,!=1.5.0, installed: 1.5.1]
    - numpy [required: >=1.21.0, installed: 1.23.4]
    - python-dateutil [required: >=2.8.1, installed: 2.8.2]
      - six [required: >=1.5, installed: 1.16.0]
    - pytz [required: >=2020.1, installed: 2022.7.1]
  - pg8000 [required: >=1.20.0,<2.0.0, installed: 1.29.4]
    - python-dateutil [required: >=2.8.2, installed: 2.8.2]
      - six [required: >=1.5, installed: 1.16.0]
    - scramp [required: >=1.4.3, installed: 1.4.4]
      - asn1crypto [required: >=1.5.1, installed: 1.5.1]
  - progressbar2 [required: >=4.0.0,<5.0.0, installed: 4.2.0]
    - python-utils [required: >=3.0.0, installed: 3.4.5]
  - pyarrow [required: >=2.0.0,<10.1.0, installed: 10.0.1]
    - numpy [required: >=1.16.6, installed: 1.23.4]
  - pymysql [required: >=1.0.0,<2.0.0, installed: 1.0.2]
  - redshift-connector [required: >=2.0.889,<2.1.0, installed: 2.0.910]
    - beautifulsoup4 [required: >=4.7.0,<5.0.0, installed: 4.11.2]
      - soupsieve [required: >1.2, installed: 2.3.2.post1]
    - boto3 [required: >=1.9.201,<2.0.0, installed: 1.26.62]
      - botocore [required: >=1.29.62,<1.30.0, installed: 1.29.62]
        - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
        - python-dateutil [required: >=2.1,<3.0.0, installed: 2.8.2]
          - six [required: >=1.5, installed: 1.16.0]
        - urllib3 [required: >=1.25.4,<1.27, installed: 1.26.14]
      - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
      - s3transfer [required: >=0.6.0,<0.7.0, installed: 0.6.0]
        - botocore [required: >=1.12.36,<2.0a.0, installed: 1.29.62]
          - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
          - python-dateutil [required: >=2.1,<3.0.0, installed: 2.8.2]
            - six [required: >=1.5, installed: 1.16.0]
          - urllib3 [required: >=1.25.4,<1.27, installed: 1.26.14]
    - botocore [required: >=1.12.201,<2.0.0, installed: 1.29.62]
      - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
      - python-dateutil [required: >=2.1,<3.0.0, installed: 2.8.2]
        - six [required: >=1.5, installed: 1.16.0]
      - urllib3 [required: >=1.25.4,<1.27, installed: 1.26.14]
    - lxml [required: >=4.6.5, installed: 4.9.2]
    - packaging [required: Any, installed: 23.0]
    - pytz [required: >=2020.1, installed: 2022.7.1]
    - requests [required: >=2.23.0,<3.0.0, installed: 2.28.2]
      - certifi [required: >=2017.4.17, installed: 2022.12.7]
      - charset-normalizer [required: >=2,<4, installed: 2.1.1]
      - idna [required: >=2.5,<4, installed: 3.4]
      - urllib3 [required: >=1.21.1,<1.27, installed: 1.26.14]
    - scramp [required: >=1.2.0,<1.5.0, installed: 1.4.4]
      - asn1crypto [required: >=1.5.1, installed: 1.5.1]
    - setuptools [required: Any, installed: 65.6.3]
  - requests-aws4auth [required: >=1.1.1,<2.0.0, installed: 1.2.1]
    - requests [required: Any, installed: 2.28.2]
      - certifi [required: >=2017.4.17, installed: 2022.12.7]
      - charset-normalizer [required: >=2,<4, installed: 2.1.1]
      - idna [required: >=2.5,<4, installed: 3.4]
      - urllib3 [required: >=1.21.1,<1.27, installed: 1.26.14]
    - six [required: Any, installed: 1.16.0]
jaidisido commented 1 year ago

Thank you @jmahlik, I agree with your points. PR #1992 should address this as part of release 3.0

jmahlik commented 1 year ago

Thank you @jmahlik, I agree with your points. PR #1992 should address this as part of release 3.0

Awesome, just ran our internal test suites off the release-3.0.0 branch and this completely resolves all the issues and passes without breakage :). Even on python 3.11.

menaitm commented 1 year ago

Hi @jaidisido, this optional dependency is the last blocker for Python 3.11 support. When will this be available on PyPI as a 3.0.0rc release?

jmahlik commented 1 year ago

@menaitm I did the workaround from https://github.com/aws/aws-sdk-pandas/issues/1714#issuecomment-1435156127 on 2.x and it worked wonderfully. Pin gremlinpython==3.6.3rc1, assuming you aren't actually using it.