aws / sagemaker-inference-toolkit

Serve machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
372 stars 82 forks source link

How to overwrite batch transform output in S3 #68

Open BaoshengHeTR opened 3 years ago

BaoshengHeTR commented 3 years ago

I did not find the doc on overwrite batch transform output If I try to run the same batch transform job multiple times along the time, how should I set the transformer to overwrite the output results (i.e., I don not change the output_path)

chuyang-deng commented 3 years ago

Hi @BaoshengHeTR, are you using Python SDK? If so, if you use the same path (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/transformer.py#L59) for multiple different times, you should have the results stored in the same location in S3.

BaoshengHeTR commented 3 years ago

Hi @BaoshengHeTR, are you using Python SDK? If so, if you use the same path (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/transformer.py#L59) for multiple different times, you should have the results stored in the same location in S3.

Yes. Doing that way makes new results append to the old ones, right? So can we set up an overwritting way? Like in Spark, we have write.mode("overwrite").

haoransh commented 3 years ago

Any update on this? I also need an overwrite mode especially when the input S3 path is the output from a spark job.

matiassciencenow commented 3 years ago

Same issue here. It would be ideal to be able to overwrite previous results from batch inferences instead of appending them, and the same feature for processing jobs.

melaniemoy commented 3 years ago

Throwing in another vote for this functionality. We had to modify our Airflow task to clean the directory before starting the prediction task, but it'd be nicer to be able to use .mode("overwrite") instead.