DiskFrame / disk.frame

Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
594 stars 40 forks source link

Error in tools::file_path_as_absolute(attr(df, "path")) #384

Open cpainsight opened 2 years ago

cpainsight commented 2 years ago


I'm new to R and disk.frame. I began using the package 6 months ago to process a 10GB CSV file. It has worked perfectly, but about a week ago the referred error message started to appear and I'm no longer able to run my code as I used to do. Nothing in my code has changed, and I have been running it for a while without any issues. Here is part of my code:

Setting up the CPU for parallel processing

setup_disk.frame() options(future.globals.maxSize = Inf)

Conversion of CSV file into disk frame to allow for parallel processing

csv_to_disk.frame(infile = "C:/Users/Ricardo Torres/OneDrive/FreeAgent Drive Back-up/My Documents/IHA/Radiation Therapy & Cancer Institute/2021/Cancer Centers/mmm/Claims/Claims2.csv", outdir = "C:/Users/Ricardo Torres/OneDrive/FreeAgent Drive Back-up/My Documents/IHA/Radiation Therapy & Cancer Institute/2021/Cancer Centers/mmm/Claims/Claims.df")

Loading the Disk Frame into a data object

utilization <- disk.frame("C:/Users/Ricardo Torres/OneDrive/FreeAgent Drive Back-up/My Documents/IHA/Radiation Therapy & Cancer Institute/2021/Cancer Centers/mmm/Claims/Claims.df/")

util_byyear <- utilization %>% srckeep(c('ToDate', 'Billed', 'Allowed', 'Deduct', 'Copay', 'Paid')) %>% mutate(y = lubridate::year(lubridate::mdy(ToDate))) %>% group_by(y) %>% summarize(billed = sum(Billed), allowed = sum(Allowed), deductibles = sum(Deduct), copays = sum(Copay), paid = sum(Paid)) %>% collect()

When I run this last part of the code, I'm getting the following error:

Error in tools::file_path_as_absolute(attr(df, "path")) : file 'C:/Users/Ricardo Torres/OneDrive/FreeAgent Drive Back-up/My Documents/IHA/Radiation Therapy & Cancer Institute/2021/Cancer Centers/mmm/Claims/Claims.df/' does not exist

xiaodaigh commented 2 years ago

I can't replicate your exact msg but I did find a bug. So I've fixed it. I have push to cran so should be available soon

I would encourage you to try out {arrow} instead of {disk.frame} and you can convert disk.frame parquet to be used in arrow with the following function.

cpainsight commented 2 years ago

Still getting the same error message, but now it adds a warning message:

Error in tools::file_path_as_absolute(attr(df, "path")) : file 'C:/Users/Ricardo Torres/OneDrive/FreeAgent Drive Back-up/My Documents/IHA/Radiation Therapy & Cancer Institute/2021/Cancer Centers/mmm/Claims/Claims.df/' does not exist In addition: Warning message: In collect.summarized_disk.frame(.) : These columns that appear in the group-by and summarise does not appear in the original data set: sum, y. This set of action is too hard for disk.frame to figure out the srckeep automatically, you must do the srckeep manually.

xiaodaigh commented 2 years ago

what does dir.exists("C:/Users/Ricardo Torres/OneDrive/FreeAgent Drive Back-up/My Documents/IHA/Radiation Therapy & Cancer Institute/2021/Cancer Centers/mmm/Claims/Claims.df/") say?

are you able to move it off onedrive and try? I wonder if there's some issues with that.

cpainsight commented 2 years ago

dir.exists("C:/Users/Ricardo Torres/OneDrive/FreeAgent Drive Back-up/My Documents/IHA/Radiation Therapy & Cancer Institute/2021/Cancer Centers/mmm/Claims/Claims.df/") [1] TRUE

xiaodaigh commented 2 years ago

I suspect this is either a Base issue or an issue with OneDrive.

Have you tested moving the data off OneDrive and testing there?

cpainsight commented 2 years ago

Just tested moving it off OneDrive and it worked. You are right, the issue is with OneDrive, I wonder what it is. All other projects work fine with files located on OD, including this one until two or three weeks ago.

xiaodaigh commented 2 years ago


Maybe the above function is doing something wrong as well, so could be. A base r problem