Open Fred-Wu opened 2 years ago
Calculating object size is quite expensive if the object is recursive or contains character vectors. It is why r.workspaceViewer.showObjectSize
is disabled by default. Looks like we should make a statement on this in the setting description.
@Fred-Wu , @renkun-ken
Just to say I am also experiencing significant lag (~4 seconds) when submitting commands after reading in 2Gb of parquet files and I don't have showObjectSize enabled.
How many columns do those parquet files have? Is it possible to create a minimal reproducible example?
@renkun-ken , Ok I have a reproducible example. If you can give me an email address I can share the data with you. In summary though there are 2 datasets:
1) 17,099,223 × 40 2) 62,981,876 × 18
Though I only read into memory:
1) 9,168,371 x 7 1) 30,304,830 x 8
It is partitioned across something like 500 files and I use the arrow library to read them all in parallel. A simplified version of the code I am using which can reproduce the issue is:
library(dplyr)
library(arrow)
BOARDS <- c(3, 4, 13)
MAPS <- 9:174
CIVS <- 1:43
match_files <- list.files(
path = "./data/source/matches",
pattern = "*.parquet",
full.names = TRUE
)
player_files <- list.files(
path = "./data/source/players",
pattern = "*.parquet",
full.names = TRUE
)
matches_slim_all <- open_dataset(match_files) |>
filter(leaderboard_id %in% local(BOARDS)) |>
select(match_id, started, match_uuid, version, leaderboard_id, finished, map_type)
matches_slim <- matches_slim_all |>
filter(map_type %in% local(MAPS)) |>
filter(!(is.na(started) | is.na(finished))) |>
filter(finished - started < (180 * 60)) |>
arrange(started) |>
collect()
players_slim_all <- open_dataset(player_files) |>
select(match_id, rating, civ, won, slot, profile_id, team, color) |>
filter(match_id %in% local(matches_slim$match_id))
x <- 1
Performance gets a little worse after each block but then falls off a cliff after this line:
filter(match_id %in% local(matches_slim$match_id))
For reference running this code in the terminal by hand has 0 performance degradation.
Thanks for the details, @gowerc. Please email the data to renkun@outlook.com if possible. I'll take a closer look at this.
This issue is stale because it has been open for 365 days with no activity.
not-stale
I believe I'm also experiencing this issue (or a related one). I'm working with parquet tables in arrow. If I set r.session.watchGlobalEnvironment
to false, the issue goes away, but I would prefer not to do that!
I should note, in my case, I have object size disabled but still get the issue. Relevant settings:
"r.session.emulateRStudioAPI": false,
"r.removeLeadingComments": true,
"r.helpPanel.cacheIndexFiles": "Workspace",
"r.sessionWatcher": true,
"r.session.watchGlobalEnvironment": true,
"r.session.objectTimeout": 5,
"r.useRenvLibPath": true,
"r.workspaceViewer.showObjectSize": false,
And even with showObjectSize disabled, the number of rows and columns of this arrow queries is still shown in the workspace viewer. I appears the showObjectSize setting is ignored for these types of queries.
Are there additional things I can do to help debug?
One more thought after a bit of digging: I wonder if adding some of these "known unsafe" conditions from RStudio's environment functions to the inspect_env
function for this extension might not fix this issue.
I'm not sure how to go about doing that, but maybe it's adding checks beyond just active-binding and is-promise?
Apologies for the multiple notifications. I intended to stop digging into this, but kept going.
I now believe the source of the slowness is the multiple calls to dim()
at this line.
Calling dim() on arrow objects and be very slow. A quick improvement would be to only call it once, like so:
obj_dim <- dim(obj)
if (!is.null(obj_dim)) {
info$dim <- obj_dim
}
A better fix might be to give users the option via a setting to turn off the dim()
calculation or provide some logic to skip calls to dim()
for certain types of objects.
I can see that in RStudio, the "dimensions" displayed for arrow objects are often just the number of columns. This aligns with what is done for dbplyr lazy tbls. However, I can't see how/where RStudio chooses to only show the column count instead of both for arrow objects.
I am on a Macbook M1 with latest R, VScode and VScode-R.
There is a 40s lag time after submitting codes to radian when importing
feather
andparquet
files, but no lag when importingcsv
files.After playing around with settings, the issue is with
Workspace Viewer: Show Object Size
. If this is disabled, there is no lag at all.