Colleague review tracking issue.

dblodgett-usgs commented 1 year ago

@mikejohnson51 agreed to provide a domain and code review. Will use this issue as a parent to track the review. Thanks for being willing to do it!!

https://www.usgs.gov/products/software/software-management/types-software-review for suggested review.

Note that test files in tests/testthat largely mimic file naming in R and test coverage attempts to fully capture desired behavior of algorithms.

mikejohnson51 commented 1 year ago

Package: hydroloom Reviewer: Mike Johnson Date: August 30, 2023

(@dblodgett-usgs added check boxes)

Description

[x] - RANN::nn2 should be imported?
[x] - Larger point is that your imports are a little all over with some using ::, some using @importsFrom in the 00_hydroloom.R file, and some using @importsFrom in the function roxygen. My personal preference would be to see all imports in the 00_hydroloom file.R using the @importFrom syntax.
[x] - Another example, fastmap should have importsFrom for fastqueue and faststack?

README:

[x] - It would be nice to see the test coverage here (since you already did the hard part!)
[x] - Kick the definitions into a table (kable or DT?). It is hard to read with the base output
[x] - Key Terms is empty (right above Terminology)
[x] - Would be nice to outline how hydroloom will support the continued functionality of nhdplusTools… it reads kinda like the latter will be superseded.
[x] - Should you add divide to Terminology?

R

00_hydrooom.R

[x] - This file is a little hard to get through, would be helpful to have imports, declarations and functions separated.
[x] - Have a rouge stats:: call on line 268. stats is in the DESCRIPTION but not the importFrom of this file?
[x] - The importsFrom here (Lines 195-202) seem partial and oddly selective.
[x] - Message or Warning in is.hy? (Lines 282-303)

Accumulate_downstream.R

[x] - Line 16 don’t need paste in stop() calls… it will concatenate strings for you
[x] - Line 23. Does a data.frame method automatically apply to tibble? (design notes indicate a preference for tibble). - - I didn’t think it did but sees it does based on this package

Add_divergence.R

[x] - Line 39… smaller id? Or hydrosequence? Smaller ID seems subjective?
[x] - Line 136 and 137 are identical.
[x] - Why some :: imports? Specifically, for tidyr and pbapply
[x] - Styling – some function explicitly return() some just leave an output… maybe worth aligning?

Add_levelpaths.R

[x] - Might be worth expanding on the weight input.
[x] - In the examples explicitly naming the inputs would help the illustrative nature. For example line 31 --? add_levelpaths(x = test_flowline, name_attribute = "GNIS_ID", weight_attribute = "ArbolateSu")
[x] - Line 126, 138 – I thought dplyr deprecated the .data$ ... calls?

add_pathlength.R

[x] - Line 43, 47: nrow(x) instead of length of column?

add_pfafstetter.R

[x] - Line 1 and 82, would be nice to just add topo_sort if not already there…
[x] - Line 87, bind_rows over do.call?
[x] - Lines 177, 178: inconsistent imports are back
[x] - -Lines 190,191: package declarations and .data call

add_streamorder-level.R

[x] - Lines 93: do you need quotes around divergence in the if statement?

add_toids.R

[x] - Line 43: is this needed?
[x] - Line 99: should this be st_as_sf()?

check_hy_graph.R

[x] - Line 40: Is the data.table merge speed worth the conversion?
[x] - Line 50: Based on line 46, I assume toid.y and toid.x exist… should one be dropped and one renamed?
[x] - Line 72, it would be nice to know what “hydroloom convention” is in the details.
[x] - Line 109,122, 188 – Need to add utils to DESCRIPTION file

get_hydro_location.R

[x] - Line 38: would an sapply or apply do the trick here instead of do.call(c, …)?
[x] - Can interp_meas, add_len and add_index all be in the same place?

index_points_to_lines.R

[x] - Line 20: is the rm needed?
[x] - Line 30-33: more .data calls and explicit function signatures
[x] - Line 52: Add “to units of CRS.” Instead of “units of points.”
[x] - Line 81: I suspect dropping m and z prior to casting would be faster in some cases (same with lines 232-234)
[x] - Line 202:204: :: calls to units
[x] - General note: a lot of this geometry stuff is handled in hyRefactor/hydrofab more generally
[x] - Line 323: should this adopt precision if supplied? (digits = 4)
[x] - Line 350: describe what happens in flines is left NULL

make_attribute_topology.R

[x] - Line 47:48: hydroom:: ? should get_node be exported?
[x] - Line 76: Is it worth noting that the future backend is exposed in the details?
[x] - Line 64: Move to 00_hydroloom.R? Or within the needed function?

make_fromids.R

[x] - Delete Lines 25:32, you have already required data.table in other places.

make_index_ids.R

[x] - Line 118: Why ‘g’? … just curious

make_node_topology.R

[x] - This is awesome. We use the divergence flag in some of the cross section work and have trouble maintaining it through aggregations.
[x] - Line 75: another hydroloom:: syntax.
[x] - Line 101: I have seen this in a few places… my understanding is the join by column should be “…created with join_by(), or a character vector of variables to join by.” (?left_join) I am actually surprised its working unquoted.
[x] - In contrast, on Line 113, select by arguments tend to be unquoted. I am guessing these are both work given your tests but it is “unconventional”
[x] - Lines 172-176 are more “conventional."

navigate_connected_paths.R

[x] - Line 123: see comment about quoted selection variables
[x] - Line 188: put utility function in utils.R
[x] - Lines 190:199: avoid .data calls (I think)

navigate_network_dfs.R

[x] - Line 52: “compatible with hydroloom” is a little unclear
[x] - This is a tricky one. Nice work!

navigation_network.R

[ ] - I much prefer the function name navigate_network over navigate_hydro_network (matching the file name)
[x] - Line 242: bind_rows?
[x] - Lines 276:316 – lots of .data$ calls

Sort_network.R

[x] - Lines 177: .data$ calls, and quoted selects

Utils.R

[x] - Line 4: .data call
[x] - Line 26: replace with tidyr::replace_na haha?
[x] - Line 33: replace withtidyr::unnest?
[x] - Line 107: Why is this function needed? E.g. data.frame(x = 1:2, y = 2:3) %>% sf::st_drop_geometry() works

Vignettes:

Topological Sort Based Network Attributes

[x] - Oh no! You have gone full Fred. HY Features is not a data model :P Maybe add the ER report?
[x] - Should reference our last paper (seems more pertinent then mainstems)? A lot of this content seems 1:1 with that.

Tests:

Seem adequate to me! I would like to see a coverage badge in the Readme

anguswg-ucsb commented 1 year ago

Hi @dblodgett-usgs,

Here is a really small edit I saw that could be made to the documentation in check_hy_graph.R

check_hy_graph.R

Line 9: In the description it says, 'is often referred to as "recursive depth first search".', but it looks like you are implementing an "iterative depth first search" in your code.

dblodgett-usgs commented 1 year ago

hah -- but it iterates through what is typically implemented as a recursive algorithm because R is not good with recursion. I'll clarify that it is the recursive depth first search algorithm but implemented with iteration.

dblodgett-usgs commented 1 year ago

Initial structural work complete according to review. I've cleaned up use of :: and consolidated all imports. There's still a bit of tech debt to take care that will be addressed with future commits for other comments.

dblodgett-usgs commented 1 year ago

To address a number of issues above, my implementation of tidy selection and masking needs to be clarified. I'll add something about this to a CONTRIBUTING.md as well.

.data was only deprecated for tidy selection. For functions like filter(), mutate(), or arrange() you have to use data masking. The details are described here. https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html I banged my head on this pretty hard when the change was rolled out. See: https://github.com/DOI-USGS/nhdplusTools/issues/306

I think I have been consistent in my use of hydroloom package attributes behind .data${var} or .data[[{var}.

My tidyselect leverages a set of standardized package attributes declared at the top of 00_hydroloom.R. It may look odd because I am passing package variables that are length one character vectors instead of "strings". This felt like a reasonable way to accomplish what I was trying to do with hydroloom names and has worked pretty well so far.

dblodgett-usgs commented 1 year ago

I reorganized 00_hydroloom according to review. I want is.hy to message instead of warn --- calling functions can issue warnings if they get FALSE.

dblodgett-usgs commented 1 year ago

As far as I know, a tibble will always have inherit from data.frame so a data.frame method will capture both tibbles and data.frames. Internally, everything is coerced to tibble when converting to hy.

dblodgett-usgs commented 1 year ago

In add_divergence, I've clarified some documentation regarding selection of the smallest ID -- this is a time when no other attributes are available to make a distinction and something has to be used to select one or the other.

There is a double unnest that looks strange because it's two of the same call back to back. It is intended.

I've searched the package for all instances of return() and verified that they are needed rather than leaving off a final value in the function.

dblodgett-usgs commented 1 year ago

I'd rather not import another package for table formatting. I put a little time into the print method for the name definitions.

mikejohnson51 commented 1 year ago

At worst I was thinking adding it to Suggests. I also think (not 100% sure) that if you explicitly declare the function in the readme (with ::) it can pass devtools check without being in the description.

dblodgett-usgs commented 1 year ago

Oh I see -- yeah, I can do that. There's a print method for that list in the package as well.

e.g. it now does

> hydroloom::hydroloom_name_definitions
1 "id": 
     shared network identifier for catchment divide and flowpath or flowline
2 "toid": 
     indicates to the downstream id. May or may not be dendritic
3 "fromnode": 
     indicates the node representing the nexus upstream of a catchment
4 "tonode": 
     indicates the node represneting the nexus downstream of a catchment
...

dblodgett-usgs commented 1 year ago

I went ahead and added the add_topo_sort function but realized that in the pfaf function, topo_sort and levelpath have to be in sync -- so I actually don't use it there. I did add that caveat to the documentation.

dblodgett-usgs commented 1 year ago

note that adding .data.table.aware was kind of an out of caution thing so I can use data.table syntax safely. I'll move it to where declare all my imports.

dblodgett-usgs commented 1 year ago

With the way I handle data.frame classes, I have to use st_sf() to add the sf class and minor attributes back to a data.frame when they get stripped from time to time. st_as_sf() is a bit more heavy handed and doesn't do quite what I want.

dblodgett-usgs commented 1 year ago

data.table usage is 100% worth the overhead to convert to/from tibble. Some very large joins (hi res NHD) hang when using dplyr. For most tables, the time to convert to data.table is so small that it has no affect. I could see having a switch for very large tables but don't feel that its necessary at this point.

dblodgett-usgs commented 1 year ago

With regards to hyRefactor / hydrofab -- let's try and move general functionality over to hydroloom over the long run?

dblodgett-usgs commented 1 year ago

Rounding in index_points_to_lines is on the linear reference. It rounds to the nearest hundredth of a percent. Not sure a control on that is really needed.

dblodgett-usgs commented 1 year ago

In general, I like to keep little utility function used in map/apply patterns as close to where I call them as reasonable. Once a function gets used in more places, I would consider moving to a general utility function space.

dblodgett-usgs commented 1 year ago

I like leaving the alternate implementation as a comment as in make_fromids. Nice for people who don't know both syntax (like me).

dblodgett-usgs commented 1 year ago

hahah g is used in some other packages for graph data. I kind of like it but it may also be confusing. I want to try and use g for the index id format of the network and x for the traditional hy form of the network.

dblodgett-usgs commented 1 year ago

navigate_hydro_network was to avoid a function signature collision with nhdplusTools. I kind of need to change it to something else.

dblodgett-usgs commented 1 year ago

tidyr utility functions and st_drop_geometry didn't used to do what I wanted... I've cleaned up those utility functions.

dblodgett-usgs commented 11 months ago

We are on CRAN. https://cran.r-project.org/package=hydroloom

DOI-USGS / hydroloom