feat: add tfrecord creation utilities.

deep-diver / semantic-segmentation-ml-pipeline

Machine Learning Pipeline for Semantic Segmentation with TensorFlow Extended (TFX) and various GCP products

https://blog.tensorflow.org/2023/01/end-to-end-pipeline-for-segmentation-tfx-google-cloud-hugging-face.html

Apache License 2.0

93 stars 20 forks source link

feat: add tfrecord creation utilities. #2

Closed sayakpaul closed 2 years ago

sayakpaul commented 2 years ago

@deep-diver after the review I will do:

Run the script and obtain the TFRecords on the entire Sidewalks dataset.
Move the TFRecords to a GCS bucket.
Prepare a notebook that shows how to parse the TFRecords.

Closes #1.

deep-diver commented 2 years ago

it looks like some pre-processing is going on such as data type conversion, normalization, and transposing. I think those operations should apply to the input images when performing inferences.

We can write the similar code as the part of SavedModel when defining model signature. However, it could be a good idea to do these in TensorFlow Transform in Transform component. It builds a Graph, and the graph is attached to the model graph. So we can minimize the training/serving skew problem.

In this project, we don't have to use Transform component, but just wanted to know your thoughts. Maybe we can add this feature after this project is done.

sayakpaul commented 2 years ago

@deep-diver I hear you!

One of the many reasons people use TFRecords to store their datasets is to eliminate the repetitive operations applied during data loading. Here, those repetitive operations include data scaling and transposition. Hence, I only applied those.

Also, it's often a good idea to serialize TFRecords in a separate pipeline. As mentioned in the script, it's usually done using Apache Beam and Dataflow. For small datasets like ours, it's not a problem. But separating this part from the rest of the pipeline is usually beneficial for real-world large datasets.

Using TF Transforms is definitely a good idea (a nice notebook here), and I am all up for using it. But I guess we need to balance between maintainability and performance here. If we are already doing a part of preprocessing during TFRecord serialization, we can essentially eliminate that from the actual data loading utilities. This can lead to performance benefits.

Let me know if anything is unclear.

deep-diver commented 2 years ago

thanks for the long explanation!

I think we don't necessarily need to focus on Transform part for now. But since Transform leverages TFT, and TFT is designed to work with Apache Beam (hence Dataflow as well), it is worth exploring about it after we done with the initial cycle (like from training to deployment w/o TFT/Transform).

I will create an issue about this, then anyone(you or me) can jump into this topic in anytime.

One clarification question : so, we don't use TFRecord in real world project? : or we still use to get the initial version of the model in the lab, then use Apache Beam to handle huge amount of data from clients after the deployment?

sayakpaul commented 2 years ago

TFRecords are very much the go-to in the TF ecosystem when you're working in a production environment.