Crunch-io / scrunch

Pythonic scripting library for cleaning data in Crunch
GNU Lesser General Public License v3.0
5 stars 7 forks source link

feature to use panda dataframes to update Crunch datasets #384

Closed SimonaCiumag closed 3 years ago

SimonaCiumag commented 4 years ago

Hi,

In R we have the possibility of downloading locally variables in dataframes, processing the data, and synchronizing back the Crunch dataset, with the corrected data. In scrunch we can't use panda dataframes with Crunch datasets to do bulk of corrections for the variables, therefore we rely on replace_values function, and sometimes (most of the times) this will have an impact on the server.

We have a project (in fact 3 similar projects) in Operations, where we have to use replace_values in order to correct the data in some grids. What we do: we add a new category and we fill in that category for each subvariable of the grid, with data based on some conditions.:

This is how it is done using scrunch: _myvar = 'buzz', for , subvar in left_ds[my_var]:, left_ds[my_var][subvar['alias']].replacevalues(value=4, filter='(missing({}) or not valid({})) and aided{} not in [32767]'.format(subvar['alias'], subvar['alias'], int(subvar['alias'].replace('{}_'.format(my_var), ''))))

This is how I did it using dataframes in R: _list_questions <- c('buzz','imp','qual','value','rep','sat','rec') for (q in list_questions){ categories(ds[[q]]) <- c(categories(ds[[q]]), Category(name='Rebase', id=4, numericvalue=NULL, missing=FALSE)) }

############set to Rebase variable buzz df <- data.frame(buzz=as.vector(ds$buzz), aided=as.vector(ds$aided)) list_subvars <- aliases(subvariables(ds$buzz)) listsubvars <- gsub("buzz","",list_subvars,fixed = TRUE)

for (q in listsubvars){ df[[paste0("buzz.buzz",q)]][df[[paste0("aided.aided",q)]] %in% c("selected","not selected") & is.na(df[[paste0("buzz.buzz",q)]])] <- "Rebase" }

update the variable in ds

for (q in listsubvars){ ds$buzz[[paste0("buzz",q)]][1:nrow(ds)] <- df[[paste0("buzz.buzz",q)]] }

These projects created issues on the server at the begging of the month, because everything was slow down making a negative impact on the server. I have attached the R script I ran as a test with dataframe...it took aprox 20 minutes to run.

Can we investigate if we can add this feature in scrunch as well?

As a mention, this is my first ticket in Github, I hope it's ok :) Also tagging @xbito , @jamesrkg

Test update vars for BA using df.docx

jamesrkg commented 4 years ago

From previous conversations @SimonaCiumag and I have had on this topic, I believe what we're hoping to achieve here is the performance and flexibility offered by R. In essense what R lets us do is to take a whole column as an R vector, then do whatever we need to that vector, and then put the entire vector back into the same column in Crunch, overwriting its values. Importantly, R lets a user do this whether they are only changing some small part of the vector or all of it. This means that the manipulation of the vector in its entirety is handled locally by R and the entire vector is sent back to replace the column in Crunch. This is not possible with Scrunch where the only similar process available to this is with create_variable - the only time an entire columns-worth of values is allowed to be handed over to Crunch.

We already know Crunch is column-wise and the way R does this is similarly column-wise. replace_values, however, with its reliance on a filter and a single value per-call, is effectively row-wise.

jamesrkg commented 4 years ago

Just to add further, one question here might be "why don't you delete the variable and re-create it with create_variable?", however this denies the numerous situations where deleting a variable is not an option due to asset dependencies.

jamesrkg commented 4 years ago

@jjdelc would appreciate your eyes on this one as well if possible!

rdrivers commented 4 years ago

Say more about what you are trying to do. Exporting to python doesn’t allow auto updating, so that’s not a direction we want to go

Sent from my iPhone

On Jul 20, 2020, at 7:40 PM, James Griffiths notifications@github.com wrote:



@jjdelchttps://github.com/jjdelc would appreciate your eyes on this one as well if possible!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/Crunch-io/scrunch/issues/384#issuecomment-661584060, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAPHFSTH7B73ZRUMOLIDI3TR4T5Y7ANCNFSM4OS3UW2Q.

jamesrkg commented 4 years ago

We're after a better way to update all of the values in a column in one step. This allows users to make any/all of the edits they need across the vector without having to send each subset of those edits as a discrete call to the API (via replace_values).

The use case of needing to edit values in-place in a dataset is crucial, even when all effort is made to use derivations where possible.

rdrivers commented 4 years ago

I don’t understand the use case. Could you give a concrete example?

Sent from my iPhone

On Jul 20, 2020, at 8:57 PM, James Griffiths notifications@github.com wrote:



We're after a better way to update all of the values in a column in one step. This allows users to make any/all of the edits they need across the vector without having to send each subset of those edits as a discrete call to the API (via replace_values).

The use case of needing to edit values in-place in a dataset is crucial, even when all effort is made to use derivations where possible.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/Crunch-io/scrunch/issues/384#issuecomment-661617291, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAPHFSXGVZP5P7CVYJZ45FLR4UG2JANCNFSM4OS3UW2Q.

jamesrkg commented 4 years ago

@rdrivers we can share some examples with you offline.

jjdelc commented 4 years ago

I talked with mike about this. Seemed that the use case was to make a categorical taking values from other categoricals. something we recently implemented on the backend, the fill function. I made a PR to add support on Scrunch for it: https://github.com/Crunch-io/scrunch/pull/386

But this creates a new derived variable, it does not overwrite the existing data.

About the replace_values limitations, they are done in that way to avoid having to upload large amounts of column data back to the server. I'll check what abstraction is R using, but it must be doing some abstraction over similar API calls.

The idea is to abstract the manipulations into specific transformations or operations so those commands can be done to the server and avoid having to receive arbitrary data overwrites.

rdrivers commented 4 years ago

Yes, intentionally not allowing over-writing. If it’s really important to resude the same name, the recommended workflow would be to rename the variable and then derive the transformed variable with the name of the old variable.

We still have some other use cases to deal with aligning tracking surveys, but the goal should be to support the needed actions, not to allow ad hoc over-writing, which will break dataset updates, is unsafe, and error-prone.

On Jul 22, 2020, at 10:43 AM, Jj notifications@github.com<mailto:notifications@github.com> wrote:

I talked with mike about this. Seemed that the use case was to make a categorical taking values from other categoricals. something we recently implemented on the backend, the fill function. I made a PR to add support on Scrunch for it:

386https://github.com/Crunch-io/scrunch/pull/386

But this creates a new derived variable, it does not overwrite the existing data.

About the replace_values limitations, they are done in that way to avoid having to upload large amounts of column data back to the server. I'll check what abstraction is R using, but it must be doing some abstraction over similar API calls.

The idea is to abstract the manipulations into specific transformations or operations so those commands can be done to the server and avoid having to receive arbitrary data overwrites.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Crunch-io/scrunch/issues/384#issuecomment-662592062, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAPHFSXJCPGIQBVR2N63XJ3R44QMJANCNFSM4OS3UW2Q.

jamesrkg commented 3 years ago

This has been addressed via new backfill.