cthoyt / pystow

👜 Easily pick a place to store data for your Python code.
https://pystow.readthedocs.io
MIT License
36 stars 6 forks source link

syncing an upstream gzip file with an expanded local version #45

Closed cmungall closed 2 years ago

cmungall commented 2 years ago

pystow has methods for syncing with a gzipped file from a URL and dynamically opening it

but if my upstream file is a gzipped sqlite (e.g. https://s3.amazonaws.com/bbop-sqlite/hp.db.gz), then I need it to be uncompressed in my ~/.data folder, before I make a connection to it (the same may hold for things like OWL)

I can obviously do this trivially, but this would require introspecting paths and would seem to defeat the point of having an abstraction layer.

For now I am putting duplicative .db and .db.gz files on s3, and only using the former with pystow, but I would like to migrate away from distributing the uncompressed versions

What I am imagining is:

url = 'https://s3.amazonaws.com/bbop-sqlite/hp.db.gz'
path = pystow.ensure('oaklib', 'sqlite', url=url, decompress=True)
conn = connect("file:///{path}")

Does that make sense?

As an aside, it may also be useful to have specific ensure methods for sqlite and/or sqlalchemy the same way you have for pandas.

cthoyt commented 2 years ago

There are several different requests here, it would be helpful to have separate discussions for each of:

  1. Auto-decompression
  2. Ensure + sqlite.connect (done in #46)
  3. Ensure + some SQLAlchemy functionality (not clear what you're asking for)

I'd suggest taking a look at a place where I already implemented something like this for ChEMBL's tar.gz'd SQLite dump: https://github.com/cthoyt/chembl-downloader/blob/d7100ba316f496ee4c36a2a684a2d9434391eb9c/src/chembl_downloader/api.py#L164-L183. Unfortunately connecting to a gzipped file from sqlite is a paid-only feature, otherwise it would be great to read them directly.

There's also already https://pystow.readthedocs.io/en/latest/api/pystow.ensure_untar.html, so I guess I could make an analog for gzip

cmungall commented 2 years ago

Let’s do anything SQLite specific in another issue

Something analogous to ensure untar would be great I’ll take a look at the chembl downloader later

On Mon, Jul 11, 2022 at 8:16 PM Charles Tapley Hoyt < @.***> wrote:

There are several different requests here, it would be helpful to have separate discussions for each of:

  1. Auto-decompression
  2. Ensure + sqlite.connect
  3. Ensure + some SQLAlchemy functionality (not clear what you're asking for)

I'd suggest taking a look at a place where I already implemented something like this for ChEMBL's tar.gz'd SQLite dump: https://github.com/cthoyt/chembl-downloader/blob/d7100ba316f496ee4c36a2a684a2d9434391eb9c/src/chembl_downloader/api.py#L164-L183. Unfortunately connecting to a gzipped file from sqlite is a paid-only feature, otherwise it would be great to read them directly.

There's also already https://pystow.readthedocs.io/en/latest/api/pystow.ensure_untar.html, so I guess I could make an analog for gzip

— Reply to this email directly, view it on GitHub https://github.com/cthoyt/pystow/issues/45#issuecomment-1181267107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMONBFCJTWP7VX7OELN3VTTPPJANCNFSM53JCBJXA . You are receiving this because you authored the thread.Message ID: @.***>

cthoyt commented 2 years ago

yeah okay I think the solution is to have an ensure_gunzip function and then double wrap the ensure sqlite and ensure_gunzip functions to get what you wanted

cthoyt commented 2 years ago

@cmungall solution is now available like:

import pandas as pd

import pystow

if __name__ == "__main__":
    sql = "SELECT * FROM entailed_edge LIMIT 10"
    url = "https://s3.amazonaws.com/bbop-sqlite/hp.db.gz"
    with pystow.ensure_open_sqlite_gz("test", url=url) as conn:
        df = pd.read_sql(sql, conn)
    print(df)