Closed cmungall closed 2 years ago
There are several different requests here, it would be helpful to have separate discussions for each of:
sqlite.connect
(done in #46)I'd suggest taking a look at a place where I already implemented something like this for ChEMBL's tar.gz'd SQLite dump: https://github.com/cthoyt/chembl-downloader/blob/d7100ba316f496ee4c36a2a684a2d9434391eb9c/src/chembl_downloader/api.py#L164-L183. Unfortunately connecting to a gzipped file from sqlite is a paid-only feature, otherwise it would be great to read them directly.
There's also already https://pystow.readthedocs.io/en/latest/api/pystow.ensure_untar.html, so I guess I could make an analog for gzip
Let’s do anything SQLite specific in another issue
Something analogous to ensure untar would be great I’ll take a look at the chembl downloader later
On Mon, Jul 11, 2022 at 8:16 PM Charles Tapley Hoyt < @.***> wrote:
There are several different requests here, it would be helpful to have separate discussions for each of:
- Auto-decompression
- Ensure + sqlite.connect
- Ensure + some SQLAlchemy functionality (not clear what you're asking for)
I'd suggest taking a look at a place where I already implemented something like this for ChEMBL's tar.gz'd SQLite dump: https://github.com/cthoyt/chembl-downloader/blob/d7100ba316f496ee4c36a2a684a2d9434391eb9c/src/chembl_downloader/api.py#L164-L183. Unfortunately connecting to a gzipped file from sqlite is a paid-only feature, otherwise it would be great to read them directly.
There's also already https://pystow.readthedocs.io/en/latest/api/pystow.ensure_untar.html, so I guess I could make an analog for gzip
— Reply to this email directly, view it on GitHub https://github.com/cthoyt/pystow/issues/45#issuecomment-1181267107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMONBFCJTWP7VX7OELN3VTTPPJANCNFSM53JCBJXA . You are receiving this because you authored the thread.Message ID: @.***>
yeah okay I think the solution is to have an ensure_gunzip
function and then double wrap the ensure sqlite and ensure_gunzip functions to get what you wanted
@cmungall solution is now available like:
import pandas as pd
import pystow
if __name__ == "__main__":
sql = "SELECT * FROM entailed_edge LIMIT 10"
url = "https://s3.amazonaws.com/bbop-sqlite/hp.db.gz"
with pystow.ensure_open_sqlite_gz("test", url=url) as conn:
df = pd.read_sql(sql, conn)
print(df)
pystow has methods for syncing with a gzipped file from a URL and dynamically opening it
but if my upstream file is a gzipped sqlite (e.g. https://s3.amazonaws.com/bbop-sqlite/hp.db.gz), then I need it to be uncompressed in my ~/.data folder, before I make a connection to it (the same may hold for things like OWL)
I can obviously do this trivially, but this would require introspecting paths and would seem to defeat the point of having an abstraction layer.
For now I am putting duplicative .db and .db.gz files on s3, and only using the former with pystow, but I would like to migrate away from distributing the uncompressed versions
What I am imagining is:
Does that make sense?
As an aside, it may also be useful to have specific ensure methods for sqlite and/or sqlalchemy the same way you have for pandas.