Technomics / readflexfile

An R package to read the Flex File data format.
GNU General Public License v3.0
2 stars 1 forks source link

Update how users handle a flexfile or list of flexfiles #15

Closed ahjames11 closed 3 years ago

ahjames11 commented 3 years ago

Currently the whole system is structured around being a list of FlexFiles. In the singular file case, this is annoying. I don't think you should have to do as follows for even simple use cases.

read_ff(file) %>%
  add_id_col(var = "doc_id")

Have started to add in an s3 flexfile class. Then we have functions is_flexfile() and is_flexfile_list(). Suggest looking into using these and then doing the respective leg work. So for example,

Current

# single file
read_ff(file) %>%
  add_id_col(var = "doc_id") %>%
  flatten_ff()

# multiple files
read_folder(files, read_ff) %>%
  listindex_to_col(var = "doc_id") %>%
  stack_ff() %>%
  flatten_ff()

Proposed

# single file
read_ff(file) %>%
  flatten_ff()

# multiple files
# can use lapply instead of map, of course
read_folder(files, read_ff) %>%
  map(flatten_ff) %>%
  bind_rows(.id = "doc_id")

# or with allocations (explicitly, should be handled by default argument in flatten_ff)
read_folder(files, read_ff) %>%
  map(allocate_ff) %>%
  map(flatten_ff) %>%
  bind_rows(.id = "doc_id")

But really you could use the s3 classes so handle it for the user, so the single case code would just work.

flatten_ff.flexfile <- function(x) {
... flatten code ...
}

flatten_ff.default <- function(x) {
  if (is_flexfile_list(x)) lapply(x, flatten_ff) else x 
}

Does this make sense? How would this impact current users? Would we have to provide some type of warning message in the code?

With this, you could basically get rid of stack_ff() (which is just a wrapper around costmisc::unnest_df() anyways).

daniel-r-germony commented 3 years ago

I do like your proposed revisions, particularly with this minor edit we talked about before:

# or with allocations (explicitly, should be handled by default argument in flatten_ff)
read_folder(files, read_ff) %>%
  map(flatten_ff, allocate = TRUE) %>%  # Actually the 'allocate = TRUE' could be dropped assuming it was set to 'TRUE' by default
  bind_rows(.id = "doc_id")

However, if you are going to open that out, conceptually, a read_ff() that returns only one submission found at the provided path - to me - is the same as returning 100 files found at that path. I think they should always get the "doc_id" since it is a good SOP. Put another way, I always forget the add_id_col(var = "doc_id") and wish it was done automatically (or at the very least suppressed only if add_id_col = FALSE were passed into the read_ff()).

Why not something like:

# single file
# the new 'path' argument could be either a file or a folder of files
read_ff(path, allocate = TRUE, add_doc_id = TRUE) %>% 
  flatten_ff()

# multiple files
read_ff(path, allocate = TRUE, add_doc_id = TRUE) %>% 
  flatten_ff()

I think the S3 class if a no-brainer and lays the ground work for lots of other improvements in the future. I think the use of a map() function actually introduces another concept that novice users won't understand so abstracting it out of the stack completely I think is preferred, at least if the goal is to provide this to a lower common dominator.

Initial implementation of a FlexFile S3 class can be found here. I'm sure I did something wrong and/or did not do something I should have so if nothing else... if might save some copy/pasting or typing out of the FFS/DEI.

ahjames11 commented 3 years ago

I am not sure I agree with the doc_id being put in there, for a couple of reasons.

  1. It does not feel like idiomatic R to me. It is natural in a SQL database to need a primary key, but in R, the list structure is what groups the file together. I just think it would be inconsistent from what an experienced R user would expect.
  2. What do you do when two files are read in separately? They would both get doc_id = 1. I see this is a very possible use case. What if they are both named flexfile.json? You can't even give them a file name as the key then. From your suggestion, you would flip add_doc_id = FALSE and then do it later, but I actually think the default behavior should be reversed. I see more than likely people having two files in separate folders and wanting to read them in separately and then combine them.

However, if you are going to open that out, conceptually, a read_ff() that returns only one submission found at the provided path - to me - is the same as returning 100 files found at that path.

On that, I think that is backwards from R thinking. One data.frame is very different from 100 data.frames. One data.frame exists as it's own object, while 100 data.frames exists as either 100 objects or as 100 level-ones in a single list. I think that is what I was going for. So I want to be able to user mtcars as is, but then I have list(mtcars, iris). I don't want to have to force list(mtcars) in the singular case. Happy to discuss the pros and cons though.

I see the link on S3, and I think we might have duplicated a bit! We have the file spec and the enumeration tables actually stored and accessible in this package. They store in Excel files and are saved as R objects in the data folder. Basically you set up the prototype for the list structure.

Check out readflexfile::flexfile_enum, readflexfile::flexfile_spec, and readflexfile::quantity_spec. That structure is common in our costverse. We can use check_spec to check a list of data frames against one of those specs and provide output with what is missing as an option.

With that, here is the start of my flexfile class. Basically it would just tag the structure on read.

Finally, I am back and forth on if read_ff() should be able to read a path. I guess to avoid using read_folder()? It would be easy enough to do, but I guess it just seems like inconsistent behavior to me, and something we would have to keep mirroring with other data sources. None of the other read_* functions that we commonly use do this. So it again feels unnatural. There is danger on trying to make it too smart.