IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

Improve the commands generated when using the calculator. #8872

Open rdstern opened 4 months ago

rdstern commented 4 months ago

@Patowhiz I think you may be able to explain the following quickly, as I assume the code is yours. Then could either you, or Antoine make the corrections?

@N-thony there are some problems running scripts, when the script is generated from the Prepare > Column Calculator. They work, but can be very slow. They also generate (extensive results in the output window that are not generated when using the calculator from the dialog.

Here is the script file:

liouville.zip

It is to be used in a practice and video. I give the first draft here, in case it is of interest to see the context But I give a shorter code below that can be used directly.

Are you a natural mathematician.docx

Here is the start of the script:

# Dialog: New Data Frame

liouville <- data.frame(x1=as.numeric(rep(seq(1,100000), each=1, length.out=100000)))
data_book$import_data(data_tables=list(liouville=liouville))

rm(liouville)
# Dialog: Calculations

liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
attach(what=liouville)
factors <- conf.design::factorize(x1 )
data_book$add_columns_to_data(data_name="liouville", col_name="factors", col_data=factors, before=FALSE)
data_book$get_columns_from_data(data_name="liouville", col_names="factors")
liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
detach(name=liouville, unload=TRUE)
data_book$append_to_variables_metadata(data_name="liouville", col_names="factors", property="labels", new_val="")
rm(list=c("factors", "liouville"))
# Dialog: Calculations

liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
attach(what=liouville)
length <- sapply(  factors ,length)
data_book$add_columns_to_data(data_name="liouville", col_name="length", col_data=length, before=FALSE)
data_book$get_columns_from_data(data_name="liouville", col_names="length")
liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
detach(name=liouville, unload=TRUE)
data_book$append_to_variables_metadata(data_name="liouville", col_names="length", property="labels", new_val="")
rm(list=c("length", "liouville"))

I notice the calculator produces 4 R-Instat statements after the calculation. Does it always need to produce so many? There is a prolem also with the data_book$get_columns_from_data statement. When run from a script it produces results in the output window? This sometimes slows R-Instat dramatically, and sometimes produces an output per row of data - and we have 100,000 rows!

I wonder if this is maybe left over from testing the "system"?

I note also that I can get the same calculations done in the Transform dialog (by a small trick). This gives much simpler code, whyuxg seems to work fine?

# Code run from Script Window (selected text)
# Dialog: Transform
x1 <- data_book$get_columns_from_data(data_name="liouville", col_names="x1", use_current_filter=FALSE)
factors <- conf.design::factorize(x1 )
data_book$add_columns_to_data(data_name="liouville", col_name="factors", col_data=factors, before=FALSE, adjacent_column="x1")
rm(list=c("x1", "factors"))

I wonder why the calculator needs so much more???

Patowhiz commented 4 months ago

@rdstern note data_book$get_columns_from_data(data_name="liouville", col_names="factors") is not assigned to anything yet it returns a data frame when executed. The returned data frame is "captured" and sent to the output window. To avoid the capture, assign to something. In this case, R-Instat is correctly doing what it is supposed to do, capture the returned data frame and display it in the output window.

rdstern commented 4 months ago

@Patowhiz many thanks for the explanation. My concern is that the same code runs so differently from the dialog and the script window. @lloyddewit maybe this ball is now in your court? The difference is shown quite neatly here:

No it isn't! I have to capture the screen instead of copying:

image

a) When run as a dialog there are effectively 5 statements - as measured by the check-boxes. b) But when run as a script there are essentially 9 statements.

Now usually running the extra statements makes no difference. In the above the line

attach(what=liouville) is an example where it doesn't matter. But running

factors <- conf.design::factorize(x1 )
data_book$add_columns_to_data(data_name="liouville", col_name="factors", col_data=factors, before=FALSE)
data_book$get_columns_from_data(data_name="liouville", col_names="factors")

as 3 separate statements matters a lot! In this example it slows R-Instat to a crawl, and later in the script it prints lots to the output window.

@Patowhiz as well as maybe changing the way the script is run I would like to understand whether there is an option to simplify the R-Instat data-book statements?

As mentioned above, I can (alternatively and a bit painfully) run the same command as above from the Prepare > Column: Numerice > Transform dialog:

image

This has sort of the same problem, but not so noticable, because there is just one R-Instat databook command, which doesn't give much output. I am not quite clear why 4 commands are needed by the calculator and only 1 by the Transform?

@Patowhiz would it do any harm if the change was for R-Instat to save the results of each of the extra commands? Then there would be no extra output from running the script, even though they were treated as separate statements in the scripts, and (seemingly) not in the dialog?

Patowhiz commented 4 months ago

@rdstern dialogs send their entire scripts directly to the R. I think when you run all in the script window, you should get the same effect. When you run line by line, the script is output in the output window after the execution. That's why you are getting multiple script outputs after every execution.

In regards to changing the script output implementation. That would require broader team discussions. Note that the output window is meant to use the same logging system that the script log window uses. The 2 are meant to be tied together. So for every output there is an underlying script log.

In regards to dialogs producing different scripts, that depends on the dialog implementation. Where possible, it's important we enforce consistency during our PR reviews. Whenever I test as a user, I tend to check on the script produced. You could do the same as well.

rdstern commented 4 months ago

@Patowhiz you said: "dialogs send their entire scripts directly to the R. I think when you run all in the script window, you should get the same effect. When you run line by line, the script is output in the output window after the execution. That's why you are getting multiple script outputs after every execution."

a) I was using Run, rather than Run All. But I mark the same block of code that corresponds to a dialog. So, I wish it was executing in the same way in the 2 cases.

b) Could you check on the code generated when doing a calculation? Here - after various iterations, I have a version that now works as fast from the script, and from the dialog.

# Dialog: New Data Frame

liouville <- data.frame(x1=as.numeric(rep(seq(1,100000), each=1, length.out=100000)))
data_book$import_data(data_tables=list(liouville=liouville))

rm(liouville)
# Dialog: Calculations

liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
attach(what=liouville)
factors <- conf.design::factorize(x1 )
a  <- data_book$add_columns_to_data(data_name="liouville", col_name="factors", col_data=factors, before=FALSE)
b <- data_book$get_columns_from_data(data_name="liouville", col_names="factors")
liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
detach(name=liouville, unload=TRUE)
data_book$append_to_variables_metadata(data_name="liouville", col_names="factors", property="labels", new_val="")
rm(list=c("factors", "liouville"))  

So adding the variable - then not used - to save the output, both helped with the output window and greatly with the speed. (3 minutes, down to 14 seconds.)

c) Then I got bolder and found I could comment out 2 of the lines with no change in the result. (Of course maybe there would be a change with other calculations? So this works fine for this example:

# Dialog: New Data Frame

liouville <- data.frame(x1=as.numeric(rep(seq(1,100000), each=1, length.out=100000)))
data_book$import_data(data_tables=list(liouville=liouville))

rm(liouville)
# Dialog: Calculations

liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
attach(what=liouville)
factors <- conf.design::factorize(x1 )
a  <- data_book$add_columns_to_data(data_name="liouville", col_name="factors", col_data=factors, before=FALSE)
#b <- data_book$get_columns_from_data(data_name="liouville", col_names="factors")
liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
detach(name=liouville, unload=TRUE)
#data_book$append_to_variables_metadata(data_name="liouville", col_names="factors", property="labels", new_val="")
rm(list=c("factors", "liouville"))

So now I have edited the whole script in this way, so saving one of the results in a, then commenting out 2 statements, including the one that was causing the problem. It seems to work fine:

# Dialog: New Data Frame

liouville <- data.frame(x1=as.numeric(rep(seq(1,100000), each=1, length.out=100000)))
data_book$import_data(data_tables=list(liouville=liouville))

rm(liouville)
# Dialog: Calculations

liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
attach(what=liouville)
factors <- conf.design::factorize(x1 )
a <- data_book$add_columns_to_data(data_name="liouville", col_name="factors", col_data=factors, before=FALSE)
#data_book$get_columns_from_data(data_name="liouville", col_names="factors")
liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
detach(name=liouville, unload=TRUE)
#data_book$append_to_variables_metadata(data_name="liouville", col_names="factors", property="labels", new_val="")
rm(list=c("factors", "liouville"))
# Dialog: Calculations

liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
attach(what=liouville)
length <- sapply(  factors ,length)
a <- data_book$add_columns_to_data(data_name="liouville", col_name="length", col_data=length, before=FALSE)
#data_book$get_columns_from_data(data_name="liouville", col_names="length")
liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
detach(name=liouville, unload=TRUE)
#data_book$append_to_variables_metadata(data_name="liouville", col_names="length", property="labels", new_val="")
rm(list=c("length", "liouville"))
# Dialog: Calculations

liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
attach(what=liouville)
even <- ifelse(length %% 2, - 1,1)
a <- data_book$add_columns_to_data(data_name="liouville", col_name="even", col_data=even, before=FALSE)
# data_book$get_columns_from_data(data_name="liouville", col_names="even")
liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
detach(name=liouville, unload=TRUE)
#data_book$append_to_variables_metadata(data_name="liouville", col_names="even", property="labels", new_val="")
rm(list=c("even", "liouville"))
# Dialog: Calculations

liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
attach(what=liouville)
liouville <- cumsum(even)
a <- data_book$add_columns_to_data(data_name="liouville", col_name="liouville", col_data=liouville, before=FALSE)
#data_book$get_columns_from_data(data_name="liouville", col_names="liouville")
liouville <- data_book$get_data_frame(data_name="liouville", use_current_filter=FALSE)
detach(name=liouville, unload=TRUE)
#data_book$append_to_variables_metadata(data_name="liouville", col_names="liouville", property="labels", new_val="")
rm(liouville)
Patowhiz commented 4 months ago

@rdstern if a dialog is producing unnecessary scripts, then that is a dialog issue. It means it is not well implemented and that should be fixed. You script example above shows that some script commands produced by the dialog are not used at all. Which means the dialog is not implemented well. The increase in performance is because the unnecessary scripts are no longer executed.

I'll look into running the scripts by block. I think it should behave as running from the dialog.

rdstern commented 3 months ago

@Patowhiz I have changed the title, because I have found another (small?) problem with the commands generated by the calculator. I explain it, though we may decide not to solve it. I suggest it is likely to be solved automatically by the changes from the discussion above.

You can use it as a simple (number) calculator without even having a data set opened. Just enter 7 * 5 and use the Try button. But what about putting the result in the output window? Well it works, but then issues some extra commands that give an error. So, maybe when we think of the commands we give round the calculator, we could eliminate this error?

I am now wondering if this feature is worth taking further. Could we discuss here how R-Instat works with - and without a data book? So, I tried using the mean function, and also sort - without data. And they work - sort of. Now the Try button sometimes gives an error, (but perhaps not), and it gives the results in the output window, see here:

image

Now I have taken that a step further, by turning off the echoing of the commands and also copying the commands manually into the Comment window:

image

Encouraging, and needs you @Patowhiz to check we are not creating problems, but not to make the initial changes. I need further thoughts, but like what might be possible here!

Patowhiz commented 2 months ago

@rdstern yes the dialog needs to be looked at. When more features are added to an existing dialog, there is always a regression risk. This is due to the complexity of the components interactions. It's always good to test previously existing features before we merge.

rdstern commented 2 months ago

@Patowhiz I have now stumbled over a really serious bug with the commands we produce with the calculator - at least - and perhaps other dialogs too.
It is simple to show with the data I used - by chance - namely from the library with the package hydroGOF.

image

The "problem" is that one of the variables has the same name as the data frame! When using the calculator the first line(s) of the R commands make all variables in that data frame available. But the calculator command now uses the data frame, rather than the variable! in its calculation. So with simple calculations it applies them to each variable. Therefore it generates multiple columns. Wrong and very confusing.

Knowing the problem I can easily generate it with other data frames. For example with the survey data, just rename the yield to survey (the name of the sheet!) and that gives crazy results (or an error, if the calculation cannot be done on some variables.)

Patowhiz commented 2 months ago

@rdstern when the same variable name is assigned to different items, the last assigned item overwrites the previous values of the variable. This is what is happening behind the scenes when using the calculator dialog.

To address this, we could implement a warning for when a column name is similar to the data sheet name. This warning could be triggered during the data import process.

rdstern commented 2 months ago

@N-thony an initial change to the R code generated with the calculator, is that it can be used, when there is no data book open.

image

But it gives 2 errors for the 2 extra commands attach and detach, that are for the data frames. Could they only be produced when there is a data frame. So otherwise it just sends the actual R command - both for Ok and To Script?

I hope this is easy?

On the dialog the Save Result (soon to become Store Result,) should also be disabled.