chengyu-liu-cs commented 4 years ago

I have been bothered by this a lot recently. The observation is that the version of Dataset registered does not change at all nor last modified timestamp when I re-ran the pipeline. However, when I checked the contents, they changed already and they are the latest. When I checked previous versions, surprisingly the data contents are changed, too. They are not old dataset but the latest dataset, that is the content is the same as the latest version. I have checked several older versions, their contents are the same as the latest version.

I have another data pipeline where several datasets are registered. For this pipeline, the dataset version changes as expected. When I compared these two pipelines, I have found a possible reason. In this pipeline, the data source paths change all the time and there is not data processing in the pipeline. While in the pipeline above, there is a processing step and the processed data are temporarily stored in azure default blob storage with the same path and name always. It seems Azure ML only checks the source path. If the source path is the same, then regardless the content changes or timestamp changes, it will ignore. I am not sure whether it was designed this like or a bug. Clearly this is not convenient. In this case, the version does not change even content has changed and I am not able to check historical versions as they are all the same.

azureml sdk version: 1.12.0

MayMSFT commented 4 years ago

this is by design. here is a doc that explains how dataset versioning currently work: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-version-track-datasets

We are actively working on feature to leverage on the new blob versioning capability to enable recovering historical version. Stay tuned!

chengyu-liu-cs commented 4 years ago

Dataset concept is designed for tracking data used for ML training and easing data scientist work. Current design is in a way to force users to manipulate data paths that are the sources of Dataset registered. This is design is beneficial for the people who are managing data (e.g, data engineers or users of data factory who need to control data flows), but not for data scientists. For data scientists, we are only interested in the datasets that are used to train models, not the sources. Now this is not possible. Maybe Delta lake concept somehow can be applied or developed for dataset as well. Look forward to improvements

chengyu-liu-cs commented 4 years ago

Regarding data, quite often "retention" feature is required. So naturally at least I am also expecting dataset retention based on certain rules, for example unregister dataset older than 2 years or only keep latest 10 versions. If Dataset is just a reference of a data path, how is Azure going to provide retention on dataset versions ?

meyetman commented 3 years ago

Retention is provided at the Azure Storage level, where Datasets then respect and follow those settings provided from the Storage level. As @MayMSFT said, stay tuned for future version updates!

please-close

lostmygithubaccount commented 3 years ago

any updates on this?

meyetman commented 3 years ago

@lostmygithubaccount Versioning enhancements are still in our roadmap. https://docs.microsoft.com/en-us/azure/machine-learning/how-to-version-track-datasets contains our most up-to-date recommendation for versioning

Azure / MachineLearningNotebooks

Dataset version is not updating even though contents have changeed #1132

please-close