Azure / azureml-sdk-for-r

Azure Machine Learning SDK for R
https://azure.github.io/azureml-sdk-for-r/
Other
104 stars 40 forks source link

Unable to Download Files from File Dataset - Error: C stack usage 160275245380 is too close to the limit #402

Open buswrecker opened 3 years ago

buswrecker commented 3 years ago

Describe the bug Unable to download files to compute instance using download_from_file_dataset

> library(azuremlsdk)
> ws <- load_workspace_from_config()
> ojdata <- get_dataset_by_name(name = "diabetesfiles", workspace = ws)
> download_from_file_dataset(ojdata, target_path = 'sampleData', overwrite=T)
Error: C stack usage  160275245380 is too close to the limit
> download_from_file_dataset(ojdata, target_path = 'sampleData', overwrite=T)

Screenshots image image

Additional context Add any other context about the problem here.

nihil0 commented 3 years ago

I have the same error, but in this case, my datastore is an Azure SQL DB. I am creating a tabular dataset from that datastore and trying to read that into a data frame. Interestingly, this works fine on Python, and I think the R package is calling the same python code. Right now, I am using a workaround where I download the files locally and read them into R data frames.

ds <- get_dataset_by_name(ws, name="traindata-test")
ds$to_csv_files()$download("./data/traindata")
df <- read_csv("./data/traindata/part-00000")

I am trying to understand why I am running into a C stack error when I try to to the conversion to R data frame from a TabularDataset using load_dataset_into_data_frame().

buswrecker commented 3 years ago

I have use load_dataset_into_data_frame() where the source data is on SQL DB and is of TabularDataset; works fine up to 10k rows and about 200 columns; i could imagine this could be problematic with larger datasets and potentially hitting C Stack errors.

zac-at-incycle commented 3 years ago

I am encountering the same error when trying to load a dataset on a compute cluster.

When run on an ML compute instance via RStudio, the code below runs fine.

When executed as part of a pipeline in an RScriptStep on ML compute cluster with the same VM sku as the RStudio compute instance, it throws the C stack error:

my_data= load_dataset_into_data_frame(my_dataset)
zac-at-incycle commented 3 years ago

Also: Attempting to use $to_csv_files()$download(...) as mentioned by @nikhilp0 is not working for me. It caused same 'C stack' error for me when run on compute instance in RStudio.

zac-at-incycle commented 3 years ago

For anyone else blocked by the same issue, I was able to work around it by downloading files directly from the Datasource and not using a Dataset at all.

Instead of

my_dataset = get_dataset_by_name(aml_workspace, my_dataset_name)
my_data = load_dataset_into_data_frame(my_dataset) # 'C stack` error thrown here when running on compute cluster

I used

input_datastore = get_datastore(aml_workspace, "input_data")
download_from_datastore(datastore=input_datastore, "./input_data", overwrite=TRUE)
my_data = read.csv("./input_data/my_data.csv")
zac-at-incycle commented 3 years ago

Also seeing the same error when attempting to use get_model() method.

zac-at-incycle commented 3 years ago

I'm now seeing the same error after deploying previously working R script and RScriptStep into a new workspace. Same code and cluster VM SKU, but in new workspace consistently throws 'C stack' error.

jakeatmsft commented 3 years ago

Blocked by same issue, when mounting a file dataset.

jakeatmsft commented 3 years ago

Blocked by same issue, when mounting a file dataset. I noticed print out the Cstack info and it seems ok. I am unable to download from datastore (workaround above) or dataset at this point is there another workaround, this is blocking a client CI/CD pipeline.

print(Cstack_info()) download_from_datastore(datastore='x', path='y', prefix='z', overwrite=TRUE) output: size current direction eval_depth 7969177 88448 1 11 Error: C stack usage 870311906868 is too close to the limit Execution halted

pourmoayed commented 3 years ago

I get also the same C stack error when trying to get some data from the workspace Datastore using a simple sql query:

qry_str <- "SELECT * FROM ws_sql_view" dataset_obj <- ws %>% get_datastore("isf_db") %>% reticulate::tuple(qry_str) %>% python_sdk$core$dataset$Dataset$Tabular$from_sql_query()

The code fails in the last line where we directly use python module in R to get the dataset object (i.e, python_sdk$core$dataset$Dataset$Tabular$from_sql_query()): Error: C stack usage 403877116004 is too close to the limit Execution halted

Any news for a possible solution for this issue?

zac-at-incycle commented 3 years ago

I'm now seeing the same error after deploying previously working R script and RScriptStep into a new workspace. Same code and cluster VM SKU, but in new workspace consistently throws 'C stack' error.

For anyone else blocked by this issue, I discovered that the difference between the two workspaces mentioned above was that one used a datasource that accessed blob storage via an account key and the other used a datasource that accessed blob storage with a SAS token.

Attempting to use a datasource with a SAS key from R SDK triggered the 'C stack' error. Using the datasource with an account key did not.

jpe316 commented 3 years ago

Tabular datasets support in the R SDK and RScriptStep are experimental and we will not be triaging issues for them at this time - please do not take a dependency on them.

We willfollow up on the file datasets issue with recommended approach.