Closed lefaivre closed 4 years ago
this is expected - you are overwriting the data stored in the datastore when you upload it to the same path.
instead, you could write v1 to data/v1
and v2 to data/v2
Thanks a lot for your response. Just a question! How come I am able to modify a version of the data manually through the UI and it doesn't change all versions then? Also, when I call Dataset.get_by_name(workspace, name='data', version=1)
what is the purpose of version=1
and version=2
?
Edit: I do see what you are saying. As you've stated the data needs to be in separate folders. So I assume there is some one-to-one mapping of the actual data version to a folder v1, v2, etc.
hi @lefaivre
Dataset is a pointer to data in your storage. here is an article that describe how versioning works: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-version-track-datasets
what do you want to modify on UI?
Is there a plan to update/modify "Dataset" implementation instead of being a point to a storage? It might be very simple implementation for you Azure, but It is really inconvenient to use. In a way current implementation forces users to manipulate the paths where data is. It is very common that people just do processing and write to a place (the same path) and register the dataset expecting Azure takes care of everything.
Hi @chengyu-liu-cs, we have two features in public preview right now that may work for you. The first is uploading a Pandas Dataframe. It will upload the dataframe, save it to Azure Storage, and then register a dataset (with a new version if needed) for you all in one step. If you change the base dataframe and upload a new version, both the original and new version of the dataset will be saved, without changing the datapath. https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#register-pandas-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true- The second feature is the same above functionality but offered for Spark Dataframes. https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#register-spark-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true-
I have entered in the overall version ask as a feature request.
Thanks! #please-close
I have some
Dataset
that I want to register, like so:Then say I make some changes to my data and try to upload and re-register the data, like so:
When I try to retrieve different versions, like so:
The datasets are the same as the newest version (i.e., the old version has been overwritten both when I inspect it in the AzureML UI and when I try to read in the data using the above code), is this expected behaviour?
Thanks a lot!
requirements.txt