IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Load and manipulate wide files in R-Instat more efficiently? #7161

Open rdstern opened 2 years ago

rdstern commented 2 years ago

ZZHR62FL.zip

This is an spss file that reads fine into R-Instat. It is about 3 mbytes, but is 50 mbytes as an spss sav file.

It has about 6000 records (rows) and over 7000 variables (columns).

When reading into R-Instat there is initially the usual screen that it is taking some time. Then that stops as I assume it is getting to the last steps, which is to load the data into the R-Instat data book. However this last step takes a long time and it looks as though R-Instat is frozen up.

I gather from @dannyparsons that this step can be done much more efficiently and that this should be quite easy to implement. Possibly we should also include the waiting screen during this last step. I am used to it taking (say) 10-20 seconds to complete and that is fine. I have not timed, but suggest this is at least 10 minutes currently.

We might also wish to consider whether there is a maximum that we allow? Perhaps this can - like other limits, and defaults, then be changed in the Tools > Options dialogue. There isn't a problem in R, having (say) 1 million variables, perhaps by transposing a data frame with a million rows. There is a problem loading these into a dialogue, and perhaps also into a grid? The maximum in reogrid is 32,768. This may not apply as we still only show a small subset, but that might also be a sensible limit for us in R-Instat. It also has a limit of 1 million rows and that certainly doesn't apply to us. (I have tried with our new grid system and 3 million rows.

dannyparsons commented 2 years ago

Tagging @Patowhiz instead of assigning.

Patowhiz commented 2 years ago

@rdstern @dannyparsons I imported the same data using R-Instat functions in R studio and below is a benchmark comparison of the 2 softwares in my machine.

Import command;

ZZHR62FL <- rio::import(file="C:/Users/LENOVO/Downloads/ZZHR62FL/ZZHR62FL.SAV")
data_book$import_data(data_tables=list(ZZHR62FL=ZZHR62FL))

Bench mark time;

Trial R-Instat RStudio
1 7.480 5.876
2 7.416 5.181
3 7.449 5.125
4 7.456 5.540
5 7.393 5.820
6 7.449 5.546

As you can see, from the above, it only took an average of around 2 mins more for the same functions to import into R-Instat R object. However in R-Studio, command; ZZHR62FL <- rio::import(file="C:/Users/LENOVO/Downloads/ZZHR62FL/ZZHR62FL.SAV")
took an average of 31.918 secs and command; data_book$import_data(data_tables=list(ZZHR62FL=ZZHR62FL))
took an average of 4.741 mins. I ran both commands 5 times in R-Studio.

From the above, it was clear to me some of the optimisations had to be done at R function level especially in the data_book$import_data(data_tables=list(ZZHR62FL=ZZHR62FL)) function.

Getting of the .Net data frame as a symbolic expression had no significant time difference. It took an average of 0.002 secs

I went further and looked at the command that gets the variables metadata

Get variables command; .temp_value <- data_book$get_variables_metadata(convert_to_character=TRUE, data_name="ZZHR62FL")

It took an average of 7.188 secs in RStudio and 7.489 secs in R-Instat. Which means it's important that the R command be executed only when necessary. The 7.489 secs could easily become minutes if the command is executed several times in R-Instat.

rdstern commented 2 years ago

@Patowhiz I wonder how much further we can go with this? In R it is quite easy to get (very) wide files. and I think the limits on number of variables is just the same as the number of rows. For example if I take the diamonds data - which is 10 variables and 53,000 rows I can transpose the data to be 10 rows and 53,000 variables. I assume thaqt isn't a problem in RStudio but may be a mess in R-Instat? I wonder what rio does with that shape - I assume it may be ok-ish? This is to understand what limits we would like and where?

Patowhiz commented 2 years ago

@rdstern thanks for asking.

Yes it is easy in R, as I stated in my above comment. Our biggest drawback is in the data_book$import_data R function when it comes to wide datasets(it adds a whopping 4 mins on average!). Remember we use this function to import all types of data, it's the entry to R-Instat. I have not looked into the R function in detail to understand why it takes time to execute for wide datasets.

Once the data is loaded, from what I have noticed currently, getting any data to and from R to R-Instat is pretty first if the rows limit is 1000 (remember even the paste functionality?).

Once we load the data into R-Instat, with dynamic loading we can handle any length or size of data that the machine can handle. And this could be done in 2 ways;

  1. We could make the controls support things like continuous dynamic loading (think of how you scroll conversations in your chat apps, you would think all the 1000 chats are there but the view control is only holding around 20 chats at any given time, the rest are dynamically loaded and used to replace the ones being displayed). In our case, for instance, the selector could be designed in a way that the user views the variables in batches of <=1000 rows , searches also are displayed in batches of <=1000 etc. This design requires careful enhancement of the necessary controls that link to and depend on each other.
  2. An alternative would be to lazy load an entire copy of necessary metadata data form R into .Net (as some sort of cache). Then make the controls access that data from the VB.Net level. This is easy to implement and fits with the current R-Instat architecture right away. And we can start here. My main concern of course is, it's an inefficient use of memory from a design perspective.
rdstern commented 2 years ago

@Patowhiz sounds good. I am happy with your 2 above. I am not specially worried about memory, particularly as I think memory use by metadata will be dwarfs by or (perhaps soon to come) undo use of memory. And I really don't see memory limitations as a common problem for R-Instat users. I am really happy that we can already manage 3 million, by 50 almost (but not quite) on a 32 bit implementation, and trivially on 64 bit. And I feel we will rarely be needing more than 1 million.

The wide data sets too is partly because that is an obvious test to apply, and I would like to pass! But the widest I have seen is 7000 variables, and even that is pretty ridiculous for statistical analyses.

But there is a bit of a principle here. The biggest argument against a GUI is the way it limits you, compared to learning the language. That will always be true, but I am very keen to be able to show we have worked to minimise the limitations.

Patowhiz commented 2 years ago

This issue broadly relates to issue #6936 and #7161. Will solve it in a series of several pull requests. Thanks.

rdstern commented 2 years ago

@Patowhiz excellent. Exciting too!

Patowhiz commented 2 years ago

@lilyclements you can have a look at my https://github.com/africanmathsinitiative/R-Instat/issues/7161#issuecomment-1075126561 above. Would be very useful to get your general response in regards to the R functions responsible for import.

lilyclements commented 2 years ago

Looking into this, it seems the slowness is in the "add_defaults_variables_metadata" function where there is a for loop for each column. In this function, we call append_to_variables_metadata, where the metadata values get appended to the current metadata. As the metadata increases, the appending takes longer.

I created some dummy data, and looking at the timings: To increase from 500 to 1000 cols took an additional 1.09 seconds. 1000 to 1500 took an additional 2.88 seconds etc etc, then 3500 to 4000 cols took an additional 8.45 seconds

I suggest we should consider using lists, since the process would not increase exponentially as the data size increases.

For example,

# create two dummy data frames with 10 columns and 5000 rows
df <- as.data.frame(matrix(0, ncol = 10, nrow = 5000))
df1 <- as.data.frame(matrix(0, ncol = 10, nrow = 5000))

# first function appends using `rbind`, so by binding df1 to df 200 times to create a data frame of size 201*5000 length
function_1 <- function(df = df1){
  for (i in 1:200){
    df <- rbind(df, df1)
    if (i %% 50 == 0) print(paste(i, date()))
  }
  return(df)
}

# second function uses lists to create a data frame of size 201*5000 length
function_2 <- function(df = df1){
  new_dat <- NULL
  for (i in 1:20){
    new_dat[[i]] <- df
  }
  return(plyr::ldply(new_dat))
}

# look at the system time of the two functions
system.time(function_1(df))
system.time(function_2(df))

# function_1 - the function that does not use lists took 6.64 seconds to run:
#   user  system elapsed 
#   5.57    1.05    6.64 
# function_2 - the function that used lists, took 0.11 seconds to run:
#   user  system elapsed 
#  0.07    0.01    0.11 

# In addition, in function_1, the timings increased exponentially:
##[1] "50 Tue Apr 26 10:51:43 2022"
##[1] "100 Tue Apr 26 10:51:44 2022"   # 50 to 100 iterations took about 1 second
##[1] "150 Tue Apr 26 10:51:46 2022"
##[1] "200 Tue Apr 26 10:51:49 2022"   # 150 to 200 iterations took about 3 seconds

So, I suggest we consider using a list since they do not get exponentially longer as the size increases. However, I'm not sure how deep this problem runs, and so would want to confirm with @dannyparsons before making any changes. It is a big function, and there is a lot going on that I do not fully understand - there may be a reason why this approach was not initially used, and/or it may affect other areas of the code that I am not aware of!

Patowhiz commented 2 years ago

@N-thony I think this can now be added to a blocker and @lilyclements can now be tagged to it. Thanks

rdstern commented 2 years ago

@Patowhiz or @N-thony or @lilyclements I wonder where we are now with the wide data files?

I tried another simple example as follows - this isn't reading. It is manipulating.

wide_data.csv

These data are 10 numeric variables and 10000 rows. (It is easy to make, and I seem to remember that you and @ChrisMarsh82 were of the view that we don't need to make a particular limit of the number of variables in R-Instat?)

Then I use Prepare > Data Reshape > Transpose. This makes it into 10,000 variables and 10 rows. If that works well, then we can do the same with 100,000 rows and hence columns when we transpose. You guys said no limit!

It takes quite a long time with the 10,000. Now that may be a one-off, and once we have produced it, then all is fine again? a) I then used the right-click at the bottom of the data frame and did a copy, to get another 10 by 10,000. That took perhaps 2 minutes. b) I used the Describe > One Variable > Summarise with quite a lot of variables - and that was very quick. c) I used the Prepare > Data Frame > Row Numbers/Names dialogue. That takes a long time, even though it is just putting a variable at the start. I wonder if the code behind the Position function is inefficient? d) I used Prepare > Column: Numeric > Enter to enter another similar variable at the end of the data. That also took a long time.

I ask now, because (as you know) the same sort of problem with many levels of a factor seems ok now. I would like to write about this limit as well - in the help.

So is this aspect: a) Remaining to be done? b) It will always take a long time to produce it the first time, but then using it is fine? (That doesn't seem to be the case - it is very command dependent.) c) Can do more but something has been done? d) We should have a limit? If so, then 10,000 seems already to be a bit high?

Patowhiz commented 1 year ago

@volloholic @lilyclements and I found the function that took time to execute for wide data sets to be the one below.

DataSheet$set("public", "append_to_changes", function(value) {

  if(missing(value)) {
    stop("value arguements must be specified.")
  }
  else {
    private$changes[[length(private$changes)+1]] <- value 
  }
}
)

The function is called append_to_variables_metadata repeatedly when importing the data. This makes it take long as it has to create a list every time a new change is added (note line private$changes[[length(private$changes)+1]] <- value).

Replacing line private$changes[[length(private$changes)+1]] <- value with private$changes<-list(private$changes, value)immensely reduced the execution time.

We opted to use this solution for now to fix this issue

lilyclements commented 1 year ago

@Patowhiz have you got a PR with the changes, or do you want me to do this?

Another side note - I spotted a small typo when we went through this. When we implement this, can we also change all "arguements" to say "arguments" (it happens a few times in the file)

lilyclements commented 1 year ago

As a side note, we should check these changes fix the following PRs:

rdstern commented 1 year ago

@lilyclements and @Patowhiz I hope you can make some progress during the June sprint. I suggest that once the basics are reasonably efficient, then we may later be able to "get clever" with selects to get really cool.
To me "the basics" could be as fdollows - with the first data from the Extremes package described in #8223 . And it is really nice that these are data that anyone might stumble upon by accident, and then just be stuck, unless the software can cope with the structure.

I suggest there are 3 steps, initially that you need to resolve: (and one to check)

a) Can you read in these data - with 12,000 variables b) In the Prepare > Data Reshape > Transpose dialog (as an example dialog) Patrick has already arranged that the data selector fills reasonably quickly - so the dialog opens reasonably soon? This is the one perhaps to check. c) Now Select all and copy to the receiver. That currently takes ages and I assume needs the same sort of changes that Patrick made to filling the selector? d) Then the transpose works fine and produces a new data frame of 4 variables and 12,000 rows. Now use the same dialog again to transpose back. Then the dialog works fine, but it now produces a new data frame with the 12000 variables again. That will need the new efficiency that you make in step a) above.

lilyclements commented 1 year ago

@Patowhiz I thought we found the solution (at least partially) to this when you were in the UK?

DataSheet$set("public", "append_to_changes", function(value) {
  if(missing(value)) {
    stop("value arguements must be specified.")
  }
  else {
    private$changes<-list(private$changes, value)
  }
}
)

Can this be implemented now?

Patowhiz commented 1 year ago

@lilyclements @rdstern I'm well aware of this thanks. @lilyclements the code change we found optimises the R level. If I was to make a PR with that R change, @rdstern wouldn't notice a significant performance change.

To optimise at the .Net level, we need to make the column metadata window 'behave' like the data viewer. This will make the column metadata window to load the columns in batches of a 1,000.

In regards to the selectors and receivers, we need to refactor them in a way that they don't have to repeatedly load and remove variables from each other. And also enhance them to work well with more than 1,000 variables.

A quick fix on the column metadata issue would be to only load the variables when it's visible (this will effectively improve the importing experience in terms of performance ), but there will be a noticeable delay when the user opens the window later.

Refactoring the selector and receiver needs to be done carefully to reduce the risk of regression in the whole software. I did look into it and couldn't see an obvious quick fix. I intend to fix this in a way that compliments the search feature in the selector as well.

rdstern commented 1 year ago

@Patowhiz I could easily live with a quick fix on the column metadata, so we make some quick progress on the other 3 problems. From the simple example, namely CarcassoneHeat, from extRemes package in the library, which has 12000 variables and 4 rows I would live with the solution (for now) that if there are more than x variables, you can only see the first 1000 variables (first page) - in the metadata, or even no column metadata - at least for now.

Then there is work on the 3 problems listed by @lilyclements above, namely

a) your reading in of the data initially b) filling a multiple receiver c) producing a new wide sheet (e.g. transpose) - and I hope c) will be very similar to a).

I don't expect this work to finish during the sprint, but could it get to the stage that it can be done by the next update, or even handed on to someone else in the team, supervised by Lily or Patrick. So, could there be a plan by the end of the sprint?

rdstern commented 1 year ago

@Patowhiz and @lilyclements currently the widest "ordinary" files I have seen are just over 7000 variables. I define "ordinary" as a data frame where the different variables are of different "types", so some may be character, others factor and many numeric. These are annoying, but also very rare. So, it would be great to be able to cope with up to (say) 8000 variables reasonably, but if they took a long time, but anything up to 2000 variables - perhaps ideally 4000 were ok, as long as you were patient, then I would be happy with that.

These are wide data frame where we would also like to be able to look at the column metadata.

Then there are "others" and a climatic example of an "other" is the first example in the ectRemes package, that is mentioned above and is in the library. This has over 12,000 variables. However, they are all the same (all numeric) except the names, and we don't need access to the metadata to see and change the names. So we don't need the column metadata.

If we want to cope with a slightly general situation then we could show (say) the first 5 or 10 columns in the metadata.

Ideally, then the last row in the column metadata would apply to all the remaining variables. This will eventually be a "select", but I don't want to get ambitious too early, because that will just dely doing anything.