REditorSupport / vscode-R

R Extension for Visual Studio Code
https://marketplace.visualstudio.com/items?itemName=REditorSupport.r
MIT License
1.08k stars 129 forks source link

Substantial lag after submitting to radian when reading feather or parquet files #1127

Open Fred-Wu opened 2 years ago

Fred-Wu commented 2 years ago

I am on a Macbook M1 with latest R, VScode and VScode-R.

There is a 40s lag time after submitting codes to radian when importing feather and parquet files, but no lag when importing csv files.

After playing around with settings, the issue is with Workspace Viewer: Show Object Size. If this is disabled, there is no lag at all.

renkun-ken commented 2 years ago

Calculating object size is quite expensive if the object is recursive or contains character vectors. It is why r.workspaceViewer.showObjectSize is disabled by default. Looks like we should make a statement on this in the setting description.

gowerc commented 2 years ago

@Fred-Wu , @renkun-ken

Just to say I am also experiencing significant lag (~4 seconds) when submitting commands after reading in 2Gb of parquet files and I don't have showObjectSize enabled.

renkun-ken commented 2 years ago

How many columns do those parquet files have? Is it possible to create a minimal reproducible example?

gowerc commented 2 years ago

@renkun-ken , Ok I have a reproducible example. If you can give me an email address I can share the data with you. In summary though there are 2 datasets:

1) 17,099,223 × 40 2) 62,981,876 × 18

Though I only read into memory:

1) 9,168,371 x 7 1) 30,304,830 x 8

It is partitioned across something like 500 files and I use the arrow library to read them all in parallel. A simplified version of the code I am using which can reproduce the issue is:

library(dplyr)
library(arrow)

BOARDS <- c(3, 4, 13)
MAPS <- 9:174
CIVS <- 1:43

match_files <- list.files(
    path = "./data/source/matches",
    pattern = "*.parquet",
    full.names = TRUE
)

player_files <- list.files(
    path = "./data/source/players",
    pattern = "*.parquet",
    full.names = TRUE
)

matches_slim_all <- open_dataset(match_files) |>
    filter(leaderboard_id %in% local(BOARDS)) |>
    select(match_id, started, match_uuid, version, leaderboard_id, finished, map_type)

matches_slim <- matches_slim_all |>
    filter(map_type %in% local(MAPS)) |>
    filter(!(is.na(started) | is.na(finished))) |>
    filter(finished - started < (180 * 60)) |>
    arrange(started) |>
    collect()

players_slim_all <- open_dataset(player_files) |>
    select(match_id, rating, civ, won, slot, profile_id, team, color) |>
    filter(match_id %in% local(matches_slim$match_id))

x <- 1

Performance gets a little worse after each block but then falls off a cliff after this line:

 filter(match_id %in% local(matches_slim$match_id))

For reference running this code in the terminal by hand has 0 performance degradation.

renkun-ken commented 2 years ago

Thanks for the details, @gowerc. Please email the data to renkun@outlook.com if possible. I'll take a closer look at this.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 365 days with no activity.

gowerc commented 1 year ago

not-stale

thomascwells commented 1 year ago

I believe I'm also experiencing this issue (or a related one). I'm working with parquet tables in arrow. If I set r.session.watchGlobalEnvironment to false, the issue goes away, but I would prefer not to do that!

thomascwells commented 1 year ago

I should note, in my case, I have object size disabled but still get the issue. Relevant settings:

        "r.session.emulateRStudioAPI": false,
        "r.removeLeadingComments": true,
        "r.helpPanel.cacheIndexFiles": "Workspace",
        "r.sessionWatcher": true,
        "r.session.watchGlobalEnvironment": true,
        "r.session.objectTimeout": 5,
        "r.useRenvLibPath": true,
        "r.workspaceViewer.showObjectSize": false,

And even with showObjectSize disabled, the number of rows and columns of this arrow queries is still shown in the workspace viewer. I appears the showObjectSize setting is ignored for these types of queries.

image

Are there additional things I can do to help debug?

thomascwells commented 1 year ago

One more thought after a bit of digging: I wonder if adding some of these "known unsafe" conditions from RStudio's environment functions to the inspect_env function for this extension might not fix this issue.

I'm not sure how to go about doing that, but maybe it's adding checks beyond just active-binding and is-promise?

thomascwells commented 1 year ago

Apologies for the multiple notifications. I intended to stop digging into this, but kept going.

I now believe the source of the slowness is the multiple calls to dim() at this line.

Calling dim() on arrow objects and be very slow. A quick improvement would be to only call it once, like so:

obj_dim <- dim(obj)
if (!is.null(obj_dim)) {
  info$dim <- obj_dim
}

A better fix might be to give users the option via a setting to turn off the dim() calculation or provide some logic to skip calls to dim() for certain types of objects.

I can see that in RStudio, the "dimensions" displayed for arrow objects are often just the number of columns. This aligns with what is done for dbplyr lazy tbls. However, I can't see how/where RStudio chooses to only show the column count instead of both for arrow objects.