Added custom functions: a) read_wb; b) get_breaks; c) to_fff.

bleonard314 commented 1 year ago

Please check: a) Documentation is properly in place; b) any new packages that I am using have the correct namespace references. Thanks!

jzadra commented 1 year ago

Thanks Ben. I've done a quick once over and added prefixes to some of the functions that didn't have them.

Can you send some example files this would be use for so that Katherine and I can use/test the functions?

Does get_breaks() need to be exported, or is it only meant to be used internally by to_fff()?

You reference "see column_header_rows" and "see column_group_count" and the like a few times in the help for to_fff(). Where would one see these?

bleonard314 commented 1 year ago

Jon,

I dumped "table_3.xlsx" in your transfer folder which is from a recent data extraction request. I just copy and pasted values to remove formatting, but that shouldn't actually impact the extraction.

Q:\For_Jonathan\From_Ben\soil_flatfile

To read, simply run:

dat <- read_wb(fpath = "Q:/For_Jonathan/From_Ben/soil_flatfile", fname = "table_3.xlsx", sheet = 1)

To extract, you could run the following code verbosely:

df <- to_fff(dat, my_vals = c(analyte = 1, units = 2), my_cols = c('result', 'qualifiers'), hrows = c(location_id = 2, sample_depth = 3, sample_date = 4, lab_id = 5), drows = get_breaks(dat, col_num = 2, rm_grps = 1), dcols = 3:198, gname = 'analyte_group', hdate = 'sample_date')

Or this, and the function will guess at the other params, but will probably get some stuff wrong:

df <- to_fff(dat, my_vals = 1:2, my_cols = 2, hrows = 2:5)

I think we should be exporting get_breaks(). It is used to guess where data groupings are using the first column by default if drows is not specified. However, in the verbose example above I provide the output of this function as an input to to_fff by specifying that I want to look for breaks in the second column (not the first). Since this definition is not always obvious and is going to be dependent on the workbook, I think the user should typically run each of the three steps separately.

kheal commented 1 year ago

@bleonard314 and @jzadra I'll test these with a recent data reformatting request. First glance they look good and documentation looks good too. Not sure I would call these "excel_custom_formatting" as their .R name suggests :), but definitely useful. I'd suggest changing the name of to_fff() to be something more descriptive and I would also suggest to add a quick vignette of their use.

bleonard314 commented 1 year ago

@bleonard314 and @jzadra I'll test these with a recent data reformatting request. First glance they look good and documentation looks good too. Not sure I would call these "excel_custom_formatting" as their .R name suggests :), but definitely useful. I'd suggest changing the name of to_fff() to be something more descriptive and I would also suggest to add a quick vignette of their use.

Excellent, Jon had started this branch for me so I could incorporate an Integral style table formatting function that is a bit more complex than this. I mainly wanted to push these functions to test my ability to contribute functions to the Integral package and figured that they might fit the bill of "excel_custom_formatting" since we are unformatting custom excel tables...However, I agree that it is confusing so feel free to rename the branch if you know how to do that kind of thing. I have no clue and probably don't have appropriate permissions.

For that task from this morning, here is a quick mock up of the params based on the email screenshot:


to_fff(
  dat,
  my_vals = c(method = 1, analyte = 2, units = 3),
  my_cols = 'result',
  hrows = c(lab_id = 9, sample_id = 10, sample_date = 11, date_received = 12, matrix = 13),
  drows = 14:22, # this might change from sheet to sheet, consider using get_breaks(dat, col_num = 1, rm_grps = 2)
  dcols = 4:5, # this also might change so you could try leaving it NULL to guess the end column
  hdate = 'sample_date' # looks like there are two dates in the header field, not sure if the function will take a character vector here
)

kheal commented 1 year ago

The branch name is no matter. I was talking more about function organization - these are in a .R file called "excel_custom_formatting.R" but that might be confusing for future folks/us if we need to debug these later.

kheal commented 1 year ago

Feedback on these functions.

1) read_wb: This is a great fxn that I have used in similar fashion so many times, it'll be great to have it as part of the package. Can we change name to read_all_wbs to indicate that you need to feed it a directory, not a single workbook? Alternatively, can we make it flexible to work with a single workbook path OR a directory of workbooks? Lastly, can we make it work on future_map() instead of map()? That way, if we need extra oof we can run it in parallel by just setting our plan before running the function (this is something I often need to do with this type of function).

2) read_wb: when we read in excels without file names, we get this ugly input: New names: • -> `...1` • -> ...2 • -> `...3` • -> ...4

adding .name_repair = "unique_quiet" to the read_excel line will quiet this message. NOTE THAT YOU WILL NEED TO REQUIRE AN UP TO DATE VERSION OF readxl and vctrs in the namespace to get this to work for all users since it's a new function. See this github issue: https://github.com/tidyverse/readxl/issues/580 .

3) to_fff(): Needs more descriptive name and a vignette to accompany it. I would call it un_xtab or xtab_to_tidy(). This will be hella helpful, thank you.

4) to_fff(): again with the new names ugliness as in 2) above. It's being derived from the plyr::bind_cols() call. Add .name_repair = "unique_quiet" and you should be good.

5) to_fff(): I don't think we should try to reformat dates here, I see this more as a restructuring to long format. Separate date parsing helpers might be nice, but those should occur after the restructuring. Alternatively we need to be more careful about what to do if the parsing of dates fails - right now you get NAs, but returning a character vector of the original is better than losing the data altogether.

6) to_fff(): specify joins by (final join at end) to avoid the message: Joining, by = "name"

See attached .R file for a slightly modified version of the read_wb() and to_fff() that I recently used - it incorporates some of these changes. Some reason I can't attach an .R file, so it's saved as a .txt. helper_fxns.txt

kheal commented 1 year ago

And resurfacing Jon's earlier comment:

You reference "see column_header_rows" and "see column_group_count" and the like a few times in the help for to_fff(). Where would one see these?

jzadra commented 1 year ago

I agree with everything Katherine has proposed, and had similar thoughts.

How versatile do you think this is? Are there lots of cases where the table3.xlsx format you shared with me are created by IDB output? As far as flexibility, I would consider 1) will this work as is with most things; 2) If there are going to be slightly different inputs that want the same end product, what changes/additions might be necessary to these functions, and will making them break code that uses the current function? Obviously this doesn't need to account for everything right away as this is in 'experimental' status, but it's worth thinking about future-proofing as much as possible.

Thoughts on individual functions:

read_wb

Agree with Katherine on either making it flexible for single wb or directory.
Is there a use case for reading a whole directory of excel files? What are the odds that all the excel files in the directory will be ones you want to apply this function to? And would they all have the same arguments when passed to get_breaks() and to_fff()?
Would it make more sense to write the function to handle a single workbook, and leave the user to use map() or a loop to do it with multiple workbooks?

get_breaks

Change params to be more descriptive. col_num could be something like column_to_break.
rem_grps isn't entirely clear to my what it is for and why you would want to remove indices. Add documentation.

Thoughts on to_fff():

It has a lot of arguments. It would be good to include in the help documentation what the default values are for the params that have them, and consider whether other sensible defaults can be added or there is any way to reduce the args. Also explain what happens when default values are left as is if it is non-obvious.
The params could use more explanation in the help, but as Katherine said a vignette would be key. I'd also suggest trying to name the params a bit more descriptively, even if it makes them more verbose - with autocomplete, it isn't a big deal. For example, my_vals could perhaps be something like cols_to_keep or preserved_data_cols; hdate could be date_column, etc.
Your param documentation that says "see column_headers_rows", etc, are referring to what? I don't see those functions

jzadra commented 1 year ago

Also, have you tried using unpivotr for this functionality and found it lacking?

IntegralEnvision / integral

Added custom functions: a) read_wb; b) get_breaks; c) to_fff. #72