Azure / azureml-sdk-for-r

Azure Machine Learning SDK for R
https://azure.github.io/azureml-sdk-for-r/
Other
105 stars 40 forks source link

Strange data type for a dataset variable using 'create_tabular_dataset_from_delimited_files' #366

Open lucazav opened 4 years ago

lucazav commented 4 years ago

I'm importing the Kaggle cars dataset from an Azure Blob Storage.

dstore <- get_datastore(ws, datastore_name = 'ml_data_cool__data')
path <- data_path(dstore, 'car-features-and-msrp/car-features-data.csv')
car_ds <- create_tabular_dataset_from_delimited_files(path = path)
car_prices_tbl <- load_dataset_into_data_frame(car_ds) %>%
  as_tibble()

Looking at the inferred data types, I can see a strange list data type for the variable "Engine Fuel Type":

image

I also tried to use the following code:

car_ds <- create_tabular_dataset_from_delimited_files(path = path,
                                                      set_column_types = reticulate::dict("Engine Fuel Type" = data_type_string()))

But I'm getting the same result.

Is it a bug? If not, how can I avoid a list for that variable?

harshbangad commented 4 years ago

I encounter the same issue when i pull in data from Azure SQL database/ csv file from Data Lake. It pulls some of the factor data type as list. I have to unlist them everytime during training as well as scoring and do the conversions within the R script manually for now.

Tried to convert them in dataset within studio but then we cannot load an existing dataset in R Studio within azure as that is an ongoing bug.

harshbangad commented 4 years ago

@lucazav - Did you find a workaround for this? This is creating issue for me during scoring as well.

jakeatmsft commented 4 years ago

I am having the same issue right now with loading from parquet dataset to dataframe, one of the columns in the dataframe is pulled as list, when it's expected as Date format.

jakeatmsft commented 4 years ago

I have also noticed that the bug does not occur consistently across environments, in my compute instance I am able to load the dataframe with the correct datatypes, but when submitting to compute cluster the data type is loaded incorrectly.