Open ivyleavedtoadflax opened 4 years ago
Note to be able to restart after an interrupted spot instance training, you must check to see whether a checkpoint exists or not. We must also send the final data to the model, we can't rely on a final processing step (e.g. tokenisation, etc.) to be done in the container, it must receive the final data.
Need to pass environment variables to sagemaker containers to define MLFlow server