Closed sayakpaul closed 2 years ago
it looks like some pre-processing is going on such as data type conversion, normalization, and transposing. I think those operations should apply to the input images when performing inferences.
We can write the similar code as the part of SavedModel when defining model signature. However, it could be a good idea to do these in TensorFlow Transform in Transform
component. It builds a Graph, and the graph is attached to the model graph. So we can minimize the training/serving skew problem.
In this project, we don't have to use Transform
component, but just wanted to know your thoughts. Maybe we can add this feature after this project is done.
@deep-diver I hear you!
One of the many reasons people use TFRecords to store their datasets is to eliminate the repetitive operations applied during data loading. Here, those repetitive operations include data scaling and transposition. Hence, I only applied those.
Also, it's often a good idea to serialize TFRecords in a separate pipeline. As mentioned in the script, it's usually done using Apache Beam and Dataflow. For small datasets like ours, it's not a problem. But separating this part from the rest of the pipeline is usually beneficial for real-world large datasets.
Using TF Transforms is definitely a good idea (a nice notebook here), and I am all up for using it. But I guess we need to balance between maintainability and performance here. If we are already doing a part of preprocessing during TFRecord serialization, we can essentially eliminate that from the actual data loading utilities. This can lead to performance benefits.
Let me know if anything is unclear.
thanks for the long explanation!
I think we don't necessarily need to focus on Transform
part for now. But since Transform
leverages TFT, and TFT is designed to work with Apache Beam (hence Dataflow as well), it is worth exploring about it after we done with the initial cycle (like from training to deployment w/o TFT/Transform).
One clarification question : so, we don't use TFRecord in real world project? : or we still use to get the initial version of the model in the lab, then use Apache Beam to handle huge amount of data from clients after the deployment?
TFRecords are very much the go-to in the TF ecosystem when you're working in a production environment.
@deep-diver after the review I will do:
Closes #1.