DerwenAI / kglab

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.
https://derwen.ai/docs/kgl/
MIT License
565 stars 64 forks source link

access anonymous / public AWS S3 object #197

Open fils opened 2 years ago

fils commented 2 years ago

With dask I can do

df = dd.read_parquet('s3://bucket/key', storage_options={'anon': True})

and it will work for a public bucket / object on AWS S3

trying

kg = kglab.KnowledgeGraph()

kg.load_parquet('s3://bucket/key', storage_options={'anon': True})

returns: NoCredentialsError: Unable to locate credentials

curious what the way to pass the anon True credentials is.

ceteri commented 2 years ago

Great point @fils !

Would it work to wrap these S3 URLs within some of the other libraries for working with them? In the load_parquet method there's support for using:

Although I haven't had a really good use case yet to test with for AWS – much of our testing is on GCP at the moment.

FWIW, we tried to integrate pathy as well, although had run into some installation problems. If that'd work better, we could revisit pathy ?

fils commented 2 years ago

@ceteri I likely lack the depth of experience to suggest a path. :)

What little I do know makes me think fsspec sounds interesting. If only since I am learning Dask and there seems to be a relation there?

I could side step this rather easily in many ways. Crudely, I could simply pulling down the parquet and loading locally, or just using my credentials. Anonymous AWS access is perhaps an edge case given the issues it could raise for a data providers wallet.

Our use case is that it might be nice to allow people to explore with some small data without any need for credentials and we have to be using AWS S3... so here we are.

Anonymous access for kg.load_parquet could have its uses. If you have suggestions on a path for now, I'd take any guidance.