aai-institute / lakefs-spec

An fsspec implementation for the lakeFS project.
http://lakefs-spec.org/
Apache License 2.0
37 stars 4 forks source link

TensorFlow and LakeFS #257

Open willianbeneducci opened 5 months ago

willianbeneducci commented 5 months ago

What is the motivation and/or use case?

This is a very nice tutorial in how to use Pandas and LakeFS integrated LakeFS File System integrated to read a *.csv file directly from repository. This is very useful.

I am looking for something similar to work with TensorFlow and images particularly. I checked and seems to me that there is nothing implemented for this so far. TensorFlow IO

This might be something very interesting to have, once that LakeFS works well with huge datasets. We can benefit from this "connection" and avoid downloading everything to a local/cloud machine to train models.

How can we implement this feature?

Ideally a plugin for TensorFlow would be awesome or an addition to the LakeFS-spec. Thanks for all the good work so far!

AdrianoKF commented 5 months ago

Thanks for your suggestion, @willianbeneducci! Could you describe your workflow in a bit more detail?

From a look at the available options, lakeFS-spec might not be in the right place for low-level I/O operations from TF (since plugins are implemented in C++). For tabular data, there might be an option to use Arrow datasets (since PyArrow supports fsspec internally already).

willianbeneducci commented 5 months ago

Sure @AdrianoKF ! First of all, thanks for your reply!

I started my architecture and script similar to the one described in this tutorial ML Data Version Control and Reproducibility at Scale. This script allows you to work locally (downloading data from lakeFS) or using distributed computing using Databricks.

It is fine to work locally when you have a small dataset because your computer might have the necessary space to download the data. However, when you have tons of data (~40 GB in my case), this is a little bit restrictive. So the Databricks solution sounds ok (but I do not have access to it), in Databricks you can download data following something like this:

# Note - This example uses static strings instead of parameters for an easier read
df = spark.read.parquet("lakefs://learn-lakefs-repo01/main/product-reviews/")
df.show()

This is very similar to the example on lakefs-spec:

import pandas as pd
data = pd.read_parquet("lakefs://quickstart/main/lakes.parquet")
print(data.head())

If I understand that correctly, we are accessing the data and not downloading it directly to the machine.

This is what I am trying to do with TensorFlow: instance a connection, get a small amount of data with a generator, train with this first batch, than repeat the process.

So far I was able to connect and see data with

# Load and show image directly from LakeFS

from lakefs_spec import LakeFSFileSystem
from PIL import Image
import io

import matplotlib.pyplot as plt

REPO, BRANCH, FILE = "<REPO_NAME>", "<EXPERIMENT_NAME>", "<FILE_NAME>"
repo_path = f"{REPO}/{BRANCH}/{FILE}"

# This will auto-discover config from ~/.lakectl.yaml
fs = LakeFSFileSystem()  

# Read back file
f = fs.open(repo_path, "rt")

# Open the file using fsspec
with fs.open(repo_path, "rb") as f:
    # Read the image bytes
    image_bytes = f.read()

# Open the image using PIL
image = Image.open(io.BytesIO(image_bytes))

# Display the image using matplotlib
plt.imshow(image)
plt.show()

Not sure if this is the most efficient way to do that, and I am still wondering how to plug this into TensorFlow.

TLDR: TensorFlow bad, lakeFS goes brr

Please feel free to ask more questions regarding this problem.

Thanks!

nicholasjng commented 5 months ago

Hey! I collected some thoughts on this checking out TFIO, and thought I'd share once I formed an opinion.

I think the way forward for as efficient as possible data sourcing are zip archives. Anything TF-related strikes me as unusually complicated, which is why I would like to direct your attention to this snippet first:

# also needs requests and aiohttp installed because of the http download below.
import fsspec
import numpy as np

mnist: dict[str, np.ndarray] = {}

baseurl = "http://yann.lecun.com/exdb/mnist/"

for key, file in [
    ("x_train", "train-images-idx3-ubyte.gz"),
    ("x_test", "t10k-images-idx3-ubyte.gz"),
    ("y_train", "train-labels-idx1-ubyte.gz"),
    ("y_test", "t10k-labels-idx1-ubyte.gz"),
]:
    with fsspec.open(baseurl + file, compression="gzip") as f:
        if key.startswith("x"):
            mnist[key] = np.frombuffer(f.read(), np.uint8, offset=16).reshape((-1, 28, 28))
        else:
            mnist[key] = np.frombuffer(f.read(), np.uint8, offset=8)

return mnist

This code is enough to load MNIST directly from Yann LeCun's website into a NumPy/JAX array. You'll notice that it just downloads the file via HTTP, opens it in GZIP compression mode, and proceeds to read straight from the resulting buffer under the hood of fsspec's HTTP file type.

If this seems like a toy example, it's actually not: When checking what TFIO does to load MNIST into TF tensors, you'll see a nice similarity:

https://github.com/tensorflow/io/blob/d10cf2514b408853c4bb6c855b97008c6c62d924/tensorflow_io/python/ops/mnist_dataset_ops.py#L59-L69

(Basically, the HTTP request is abstracted away, and tf.io.decode_raw replaces np.frombuffer. It also instantiates the images one-by-one in the two .map() calls below the snippet I linked.)

Piecing these two together, I would proceed as follows (if your workflow allows this of course):

a) Zip up batches of images into a gzip archive. b) Push/pull these using the LakeFSFileSystem (the narrower the dtype, the better - in case of MNIST, you can go as low as uint8). c) Roll a custom subclass of tf.data.Dataset as in the link above for MNIST, replacing the offsets and shapes found therein with those relevant for your image sizes. d) Instruct it to decode the raw buffers obtained by e.g. file = LakeFSFileSystem.open(rpath, "wb"). (You might need to access a private member for this, the IDE is your friend).

Is that enough to get you started? Let me know if you have questions. If you're interested in a complete example, I might try my hand at implementing one when I get time, but I'm probably busy for at least this week.

willianbeneducci commented 5 months ago

Thanks a lot @nicholasjng! This is more than enough. A very compact solution indeed and might work very well with zip files. However, I do not think that I will be able to store data in this format once that is interesting at lakeFS level to be able to see some of images in the UI for explanation and debugging reasons.

I checked this TensorFlow issue on StackOverflow about Image Generator and this gave me an idea.

Checking TensorFlow v2.1.5 source code is possible to see that there is a method called "_get_batches_of_transformed_samples" which implements the following:

(...)
img = image_utils.load_img(
                filepaths[j],
                color_mode=self.color_mode,
                target_size=self.target_size,
                interpolation=self.interpolation,
                keep_aspect_ratio=self.keep_aspect_ratio,
            )
(...)

It uses PIL to load images to memory. I changed it a little bit to:

          # MODIFIED       
          # This will auto-discover config from ~/.lakectl.yaml
          fs = LakeFSFileSystem()  
          # Read back file
          f = fs.open(filepaths[j], "rt")
          # Open the file using fsspec
          with fs.open(filepaths[j], "rb") as f:
              # Read the image bytes
              image_bytes = f.read()
          # Open the image using PIL
          img = Image.open(io.BytesIO(image_bytes))
          # Resize
          img = img.resize(self.target_size, Image.Resampling.NEAREST)

It is not a good practice to rewrite library methods so I wrote a custom generator which performs a method overriding.

Something like:

1 - Create a class "MyBatchFromFilesMixin" and override "_get_batches_of_transformed_samples"
2 - Create a class "MyDataFrameIterator" that inherit "MyBatchFromFilesMixin" and 
     "tf.keras.preprocessing.image.Iterator"
3 - Create a "CustomDataGen" that that inherit "tf.keras.preprocessing.image.ImageDataGenerator"

Step 3 uses what is done in Step 2 and Step 2 uses what is done in Step 1. 
We do not need to change a line in TensorFlow source code.

This worked but... kind of hard to maintain over time.

What do you think about it? Do you think this will cause performance issues or any other concern?

nicholasjng commented 5 months ago

That's very clean, but I imagine that the repeated I/O cost of opening single images will quickly eat into your performance.

If you insist on single paths instead of zipped image batches, I guess you could prefetch some of the paths on iterator construction.

willianbeneducci commented 5 months ago

Yes, you are right. Maybe if I instantiate it outside the loop would be better. But not optimal.

I was thinking about the workflow with lakefs and the lakectl local command and maybe it will be better to keep everything local to make the compatibility with TensorFlow, Keras, OpenCV etc... easier... This tutorial shows, for example, that using lakectl local it will create a file for github to keep track of the data commit. lakeFS + Github + MLflow

"But Git will version control the “.lakefs_ref.yaml” file created/updated by “lakectl local” commands. “.lakefs_ref.yaml” file includes the lakeFS source/branch information and commit ID. This way code as well as commit information about data will be kept together in Git repo. "

So, even if I plug directly to the repo using lakefs, later to versionate data or to train models using other libraries I will need too push much more effort than just simple using lakeFS local.

Do you agree? Not sure if it is clear the maintenance and compatibility issues that I am pointing now.