Uploading and registering a dataset overwrites the previous versions

lefaivre commented 4 years ago

I have some Dataset that I want to register, like so:

datastore = workspace.get_default_datastore()
datastore.upload(src_dir='data', target_path='data')
dataset = Dataset.Tabular.from_delimited_files(datastore.path('data/iris.csv'))
dataset = dataset.register(
    workspace=workspace,
    name='data',
    description='training dataset',
    create_new_version=True
)

Then say I make some changes to my data and try to upload and re-register the data, like so:

datastore.upload(src_dir='data', target_path='data', overwrite=True)
dataset = dataset.register(
    workspace=workspace,
    name='data',
    description='I added a column',
    create_new_version=True
)

When I try to retrieve different versions, like so:

dataset1 = Dataset.get_by_name(workspace, name='data', version=1)
dataset2 = Dataset.get_by_name(workspace, name='data', version=2)

The datasets are the same as the newest version (i.e., the old version has been overwritten both when I inspect it in the AzureML UI and when I try to read in the data using the above code), is this expected behaviour?

Thanks a lot!

requirements.txt

lostmygithubaccount commented 4 years ago

this is expected - you are overwriting the data stored in the datastore when you upload it to the same path.

instead, you could write v1 to data/v1 and v2 to data/v2

lefaivre commented 4 years ago

Thanks a lot for your response. Just a question! How come I am able to modify a version of the data manually through the UI and it doesn't change all versions then? Also, when I call Dataset.get_by_name(workspace, name='data', version=1) what is the purpose of version=1 and version=2?

Edit: I do see what you are saying. As you've stated the data needs to be in separate folders. So I assume there is some one-to-one mapping of the actual data version to a folder v1, v2, etc.

MayMSFT commented 4 years ago

hi @lefaivre

Dataset is a pointer to data in your storage. here is an article that describe how versioning works: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-version-track-datasets

what do you want to modify on UI?

chengyu-liu-cs commented 4 years ago

Is there a plan to update/modify "Dataset" implementation instead of being a point to a storage? It might be very simple implementation for you Azure, but It is really inconvenient to use. In a way current implementation forces users to manipulate the paths where data is. It is very common that people just do processing and write to a place (the same path) and register the dataset expecting Azure takes care of everything.

meyetman commented 4 years ago

Hi @chengyu-liu-cs, we have two features in public preview right now that may work for you. The first is uploading a Pandas Dataframe. It will upload the dataframe, save it to Azure Storage, and then register a dataset (with a new version if needed) for you all in one step. If you change the base dataframe and upload a new version, both the original and new version of the dataset will be saved, without changing the datapath. https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#register-pandas-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true- The second feature is the same above functionality but offered for Spark Dataframes. https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#register-spark-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true-

I have entered in the overall version ask as a feature request.

Thanks! #please-close

Azure / MachineLearningNotebooks

Uploading and registering a dataset overwrites the previous versions #1156