fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.05k stars 362 forks source link

Reading Files from Git LFS Repo #1438

Open johko opened 1 year ago

johko commented 1 year ago

Hey,

I'm trying to read files from a GitHub LFS Repo (https://github.com/openai/dalle3-eval-samples/tree/main) but only get the pointers to the actual large files (the images in the repo), instead of the binaries.

Is there any way of reading these files from an LFS repo with fsspec?

My current testing code is:

import fsspec

github_repo = fsspec.get_mapper("github://openai:dalle3-eval-samples@main")
for file_name in github_repo:
    file = github_repo[file_name]
martindurant commented 1 year ago

We don't have such an integration. I don't know how LFS works in detail, but I imagine it's not too complex, if you would like to implement it. I know that some git-based data services which integrate already with fsspec (dvc, lakefs, xet, maybe others).

johko commented 1 year ago

Thanks for the really quick response @martindurant .

I have to admit I don't know too much of the inner workings of LFS myself. But sounds like a fun project to investigate, so if I find the time I'll implement it :slightly_smiling_face:

martindurant commented 1 year ago

It doesn't look terrible:

>>>  print(github_repo["t2i_compbench/sdxl/complex_val/The%20black%20camera%20was%20next%20to%20the%20white%20tripod._000160.png"].decode())
version https://git-lfs.github.com/spec/v1
oid sha256:ac618aaf4f05d1a938323f4d37d78877d03b5afb2d4f04af183f298d60e33b55
size 1133715