bedatadriven / activityinfo-R

ActivityInfo R Language Client
https://www.activityinfo.org/support/docs/R/
17 stars 12 forks source link

Tidy-compatible record fetching #48

Closed Ryo-N7 closed 1 year ago

Ryo-N7 commented 1 year ago

I get requests to grab table from multiple tables and/or multiple project databases at once. Basically stuff that would be really tedious to do manually (dragging-dropping all the specific columns in the UI, switching between tables and projects, grabbing the correct reference fields, etc.).

So I have my colleagues send me a CSV/Excel of what projects, tables, and fields are needed as columns. See example below (I left-joined together the database ID and form IDs with R separately):

this

So basically I want to programmatically iterate over each row so that I'm grabbing the specific fields for a specific table in a specific project, then the next, then the next, etc.

So I created a function to grab the columns I need, transform it into the 'named list' structure needed for the columns argument in queryTable():

farmjobs-query-fields

but I get this back:

farmjobs-query-res

Only the Survey.Date field is returned with proper data because that's the only field that is native to the form I'm querying.

So looking at this, it means that I have to deliberately go search inside each form-schema and find the correct field ID which adds another layer of complexity and makes something like this where I'm simply trying to grab data from a given set of "field names" and "table IDs" a lot harder than it should be.

What I'm given: whatimgiven

What I actually need, which would require even more steps instead of being able to simply match by field name.

whatiactuallyneed

The User Interface 'export via R' option gives me the proper field codes as shown above but I can't extract that via R.

In reality, I can't even find the above "cb1ee1b8.c288fa67" in this form's schema as it only shows me "cb1ee1b8" for the fieldId. Which means I'll have to again, jump another level to search the parent form and look up the field using only the field label and grab the field ID of the referenced field.

howamisupposetofindthis

I shouldn't need any more than this the first screenshot in this email to grab the fields I need from multiple tables in multiple projects but as I explained above its not enough and that's a huge bottleneck for me. I can't ask my colleague to provide me with the reference form IDs and reference field ID because at that point they might as well just manually download all of the forms themselves using the user interface.

So what should be a simple loop over queryTable()​ per 'table_id' selecting columns in 'Field.Names' turns into a complicated process of jumping through 2 different form schemas (the child form schema and then the parent form schema, matching the field by label) then selecting using that (or so I imagine the process would go).

Is there no way to make this process easier?

Or, as I imagine this won't be a simple fix, if we could instead fix the issue with large forms with many reference fields getting status 500 errors when trying to extract them from the API with queryTable(), then I can just grab everything (queryTable() with no 'columns' argument) and dplyr::select() afterwards instead.

akbertram commented 1 year ago

queryTable() should have the option to provide the same projection of the form as the user interface.

akbertram commented 1 year ago

Another option would be to provide a DSL in R to make it easier to write these kinds of queries for custom projections. For example:


referenceWithKeyFields <- function(fieldLabel) {
  function(schema) {
    i <- which(sapply(schema$elements,function(e) e$label) == fieldLabel)
    field <- schema$elements[[i]]
    referencedFormId <- field$typeParameters$range
    referencedSchema <- getFormSchema(referencedFormId)
    keys <- which(sapply(schema$elements,function(e) e$label) == fieldLabel)
    // TODO
  }
}

queryTable("cffcflkl1opdbtk2", columns = byLabel("Annual income"), referenceWithKeyFields("Territory"), referenceWithAllFields("Municipality"), referenceWithId("Territory"))
akbertram commented 1 year ago

Short term fix: See #58 Discussion to be continued.

akbertram commented 1 year ago

Notes from our discussion: @Ryo-N7 @nickdickinson

getRecords()

Which columns?

Everything, with knobs/options

How to handle references?

referencedIds = TRUE/FALSE referencedKeys = TRUE/FALSE

Naming of columns

columnNames = c("labels", # same as in table view "codes" # use field codes, fall back to labels if code is missing? "ids") # use field ids

checkNames = TRUE/FALSE # should the result of above be mangled to be correct R syntax enforceUnique = TRUE/FALSE

Same as UI:

allFields(referencedIds = FALSE, referencedKeys = TRUE, naming = "labels")

ui() # Maybe everything from the UI + record ids

Only fields/columns I ask for...

columns = c(a = "_id", b = "province", c = "province.name")

GeoPoints

1- make interaction with sf or other packages straightforward 2- easy to write the data frame to csv with lat/lng

another knob that can control the format

attachments: later? only as tibble/vector?

Which records?

Filter argument in queryTable(), but difficult to write activityinfo formula mixed with R code

for(province in c("Nord Kivu", "Sud Kivu")) {

Ability to specify formulas using R syntax and (some) R functions.

Need a way to distinguish between symbols that refer to ActivityInfo form

fields and symbols in the current environment.

getRecords(formId, filter = province == !province) }

quosure

select(df, Province %in% p)

Ryo-N7 commented 1 year ago

quick thought:

wondering if we should rename getRecord() to getSingleRecord() as the new version of queryTable() becoming getRecords() makes them very similar and auto-complete will be annoying?

nickdickinson commented 1 year ago

That makes sense but if we also implement an attachment vector for the attachment columns, will we need the getRecord() call?

I guess you would have to deprecate getRecord and favor getSingleRecord so that it doesn't immediately break existing scripts.

Ryo-N7 commented 1 year ago

i mean people could potentially be using getRecord() for reasons beside the attachment use-case

--- but yeah deprecation makes sense

nickdickinson commented 1 year ago

@Ryo-N7 First version is now ready if you want to try it. Recompile thought from v4.33:

Image

For your purposes, to get the UI columns and record IDs, etc, you would do the following: fmTree <- setStyle(fmTree, allReferenceFields = TRUE, columnNames = "ui", referencedId = TRUE, referencedKey = TRUE, recordId = TRUE, lastEditedTime = TRUE)

I will make a function to set a default style (already possible as an option). Then it looks like this:

Image

getRecords(fmTree) %>% select(starts_with("CSO")) %>% head(2) %>% collect()

Ryo-N7 commented 1 year ago

i tried the latest v4.33, can't seem to select the proper columns like you've shown above with the dplyr verbs

getrecords-test

i can select the Active column with the first one contains("Activ") but i'm having trouble trying to grab a number of different columns in this table with ActivityInfo in them despite using contains("Activity")

projsch <- getFormTree(formId = "c8301e4022c")

# getRecords(form = "c8301e4022c") # form ID or form schema or form tree >> form ID NOT work right now

projtab <- getRecords(form = projsch) %>% select(contains("Activ")) %>% collect()
projtab <- getRecords(form = projsch) %>% select(contains("Activity")) %>% collect()
Ryo-N7 commented 1 year ago

on the other rhand, the set UI col names things appears to be working, which is nice!

grab-by-ui-cols

nickdickinson commented 1 year ago

Implement filter() and arrange() and send to @Ryo-N7 for testing later this week.

nickdickinson commented 1 year ago

Also add tests

Ryo-N7 commented 1 year ago

@nickdickinson any updates on the new version for me to test yet? i won't be able to test before our monday meeting at this rate

nickdickinson commented 1 year ago

@Ryo-N7 Apologies for the delay. I've uploaded the new version now.

Please see the following code from the tests where I show much of the functionality (and some expected limitation). I will still need to add documentation. Additionally, I'm not really sure how to get the sorting function to work with more than a single column. There is also a problem that doing select() or rename() after sorting and for select() after filtering will cause problems if the column used in previous steps is not retained. A warning is thrown but since I have no way (for now) of knowing which columns those are, the user will need to be careful not to do so.

The verbs actually implemented are:

filter() select() rename() rename_with() slice_head() slice_tail()

Verbs that will collect and then warn are:

group_by() summarise() mutate() slice_*() arrange()

We could implement arrange() but I don't know how to get the API to sort more than one column at the time and it will take a little effort. It would need to collect() almost right away if anyone wants to sort more than one.

For now, you can use addSort() as shown below as a workaround for now but it is not great because of the weird list format and the limitation of one column.

  testData <- tibble(`Identifier number` = as.character(1:500), "A single select column" = rep(factor(paste0(1:5, "_stuff")), 100), "A logical column" = ((1:500)%%7==(1:500)%%3), "A date column" = rep(seq(as.Date("2021-07-06"),as.Date("2021-07-25"),by = 1),25))

  schemaPackage <- createFormSchemaFromData(testData, database$databaseId, label = "getRecords() test form", keyColumns = "Identifier number", requiredColumns = "Identifier number")

  schemaView <- as.data.frame(schemaPackage$schema)

  uploadedForm <- addForm(schemaPackage$schema)

  importTable(formId = schemaPackage$schema$id, data = testData)

  rcrds <- getRecords(uploadedForm$id, style = prettyColumnStyle())

  dfA <- rcrds %>% 
    addFilter('[A logical column] == "True"') %>% 
    addSort(list(list(dir = "ASC", field = "_id"))) %>%
    slice_head(n = 10) %>% collect()

  dfB <- rcrds %>% 
    filter(`A logical column` == "True") %>% 
    addSort(list(list(dir = "ASC", field = "_id"))) %>%
    slice_head(n = 10) %>% collect()

  attr(dfA, "remoteRecords") <- NULL
  attr(dfB, "remoteRecords") <- NULL

  testthat::expect_identical(dfA,dfB)

  testthat::expect_warning({
    dfC <- rcrds %>% 
      filter(`A logical column` == "True") %>% 
      arrange(`_id`) %>%
      slice_head(n = 10) %>% collect()
  })

  attr(dfC, "remoteRecords") <- NULL

  attr(dfB, "rows") <- NULL
  attr(dfC, "rows") <- NULL

  testthat::expect_identical(dfB,dfC)

  # removing columns required for a filter will result in an error
  # expect warning using select after filter or sort
  testthat::expect_error({
    testthat::expect_warning({
      recordIds <- rcrds %>% 
        addFilter('[A logical column] == "True"') %>% 
        addSort(list(list(dir = "DESC", field = "A date column"))) %>%
        head(n = 10) %>%
        select(`_id`) %>%
        collect() %>%
        pull("_id")
    })
  })

  testthat::expect_no_warning({
    rcrds %>% 
      select(id = `_id`, date = `A date column`, logical = `A logical column`) %>%
      filter(logical == "True") %>% 
      rename(logical2 = logical) %>%
      collect()
  })

  # renaming columns required for a sort will result in an error
  testthat::expect_error({
    testthat::expect_warning({
      rcrds %>% 
        select(id = `_id`, date = `A date column`, logical = `A logical column`) %>%
        filter(logical == "True") %>% rename(date2 = date) %>% 
        addSort(list(list(dir = "DESC", field = "date"))) %>% 
        head(10) %>% 
        collect()
    })
  })

  rcrds %>% select(id = `Identifier number`) %>% copySchema(databaseId = "dbid", label = "new form")

  # no form schema elements to provide - expect warning
  testthat::expect_warning({
    rcrds %>% select(starts_with("_")) %>% copySchema(databaseId = "dbid", label = "new form")
  })

  # Ok to rename columns after only a filter step
  recordIds <- rcrds %>% 
    select(id = `_id`, date = `A date column`, logical = `A logical column`) %>%
    filter(logical == "True") %>% rename(logical2 = logical) %>% 
    addSort(list(list(dir = "DESC", field = "date"))) %>% 
    head(10) %>% 
    collect() %>%
    pull("id")

  testthat::expect_equal(length(recordIds), 10)

  rcrds %>% rename_with(function(x) {paste0(x," haha")}, starts_with("_")) %>% collect()

To create a form schema from a remote records object do the following (select relevant columns and then copy schema):

rcrds %>% select(id = `Identifier number`) %>% copySchema(databaseId = "dbid", label = "new form")
nickdickinson commented 1 year ago

@akbertram Can you help me to get the sorting to work correctly?

  1. Is it possible to sort across multiple columns? I tried to add to the list of sorts but it seems to ignore additional sorts...
  2. When column names are pretty columns like the UI and have spaces and special symbol, the sort object seems to work like this:
    rcrds %>% addSort(list(list(dir = "DESC", field = "Identifier number")))

    But in the web interface, I see it appending "sort$" to the column name (which happens to be the form field id in the UI. But I cannot use the format "sort$Identifier number" presumably because it has a space in it. I don't think the "sort$[Identifier number]" works either.

It also does not seem to work with the form field ID if I don't explicitly name the column with the form field ids.

Is there a reason to add "sort$"?

Ryo-N7 commented 1 year ago

hey @nickdickinson

getting some regressions when using the new columnStyle() functions to grab the columns exactly as what's in the UI labels:

colnames-UI-act4 33

image

compared with the successful implementation from last week/2 weeks ago, check below comment hyperlinked:

https://github.com/bedatadriven/activityinfo-R/issues/48#issuecomment-1463939171


EDIT:

or hmm,,, am I supposed to use prettyColumnStyle() now??

EDIT EDIT : i put a more detailed comment with screenshots over in https://github.com/bedatadriven/activityinfo-R/issues/58 as that's more relevant for this particular problem

nickdickinson commented 1 year ago

Yes, prettyColumnStyle():

getRecords(uploadedForm$id, style = prettyColumnStyle())

Alternatively, you should be able to do allColumnStyle(columnNames = "pretty").

I will work on the documentation some more examples and see if I can reproduce the problems in #58

Ryo-N7 commented 1 year ago

the output object from getRecords() (as in running that without collect()) gives me a list object that gives me weird errors and is also an object that can't be View()-ed nor inspected as a usual R object.

I can use str() to look at it but it doesn't give me helpful info. So if I want to first explore what columns and rows there are I have to use collect() to download the entire thing, for a large form that is a bit of wasted effort/RAM space.

Is there anyway we could use something like glimpse() on a getRecords() object to look at the schema or get some summary of the data set somehow?

Otherwise I'd have to collect() all to inspect or have the form in ActivityInfo open in another window or something

Not quite familiar with the dbplyr object to know how feasible it is or not

rec-obj

view-rcrds

str-rcrds

EDIT: hmm nevermind, i was able to glimpse() only a getRecords() object without collect() and it worked today when it didn't before, odd.

sorry!!

Ryo-N7 commented 1 year ago

yeah, it's fine >> glimpse-rcrds

still not sure what the Error in names[[i]] : subscript out of bounds error is though

Ryo-N7 commented 1 year ago

not nearly as familiar with dbplyr but i assume that one would still be able to check out the object following some tidy-steps without running collect()?

below i ran some of the test code, without the collect() part and got some grouping error:

rcrd-tidy-fail

perhaps it's just my lack of knowledge with dbplyr syntax so to check my work//the various steps i've done i would have to use collect() every single time?