DOI-USGS / hydroloom

hydrologic geospatial fabric creation tools. See official repository here: https://code.usgs.gov/water/hydroloom
https://doi-usgs.github.io/hydroloom/
Creative Commons Zero v1.0 Universal
25 stars 2 forks source link

Colleague review tracking issue. #12

Closed dblodgett-usgs closed 11 months ago

dblodgett-usgs commented 1 year ago

@mikejohnson51 agreed to provide a domain and code review. Will use this issue as a parent to track the review. Thanks for being willing to do it!!

https://www.usgs.gov/products/software/software-management/types-software-review for suggested review.

Note that test files in tests/testthat largely mimic file naming in R and test coverage attempts to fully capture desired behavior of algorithms.

mikejohnson51 commented 1 year ago

Package: hydroloom Reviewer: Mike Johnson Date: August 30, 2023

(@dblodgett-usgs added check boxes)

Description

README:

R

00_hydrooom.R

Accumulate_downstream.R

Add_divergence.R

Add_levelpaths.R

add_pathlength.R

add_pfafstetter.R

add_streamorder-level.R

add_toids.R

check_hy_graph.R

get_hydro_location.R

index_points_to_lines.R

make_attribute_topology.R

make_fromids.R

make_index_ids.R

make_node_topology.R

navigate_connected_paths.R

navigate_network_dfs.R

navigation_network.R

Sort_network.R

Utils.R

Vignettes:

Topological Sort Based Network Attributes

Tests:

anguswg-ucsb commented 1 year ago

Hi @dblodgett-usgs,

Here is a really small edit I saw that could be made to the documentation in check_hy_graph.R

check_hy_graph.R

dblodgett-usgs commented 1 year ago

hah -- but it iterates through what is typically implemented as a recursive algorithm because R is not good with recursion. I'll clarify that it is the recursive depth first search algorithm but implemented with iteration.

dblodgett-usgs commented 1 year ago

Initial structural work complete according to review. I've cleaned up use of :: and consolidated all imports. There's still a bit of tech debt to take care that will be addressed with future commits for other comments.

dblodgett-usgs commented 1 year ago

To address a number of issues above, my implementation of tidy selection and masking needs to be clarified. I'll add something about this to a CONTRIBUTING.md as well.

.data was only deprecated for tidy selection. For functions like filter(), mutate(), or arrange() you have to use data masking. The details are described here. https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html I banged my head on this pretty hard when the change was rolled out. See: https://github.com/DOI-USGS/nhdplusTools/issues/306

I think I have been consistent in my use of hydroloom package attributes behind .data${var} or .data[[{var}.

My tidyselect leverages a set of standardized package attributes declared at the top of 00_hydroloom.R. It may look odd because I am passing package variables that are length one character vectors instead of "strings". This felt like a reasonable way to accomplish what I was trying to do with hydroloom names and has worked pretty well so far.

dblodgett-usgs commented 1 year ago

I reorganized 00_hydroloom according to review. I want is.hy to message instead of warn --- calling functions can issue warnings if they get FALSE.

dblodgett-usgs commented 1 year ago

As far as I know, a tibble will always have inherit from data.frame so a data.frame method will capture both tibbles and data.frames. Internally, everything is coerced to tibble when converting to hy.

dblodgett-usgs commented 1 year ago

In add_divergence, I've clarified some documentation regarding selection of the smallest ID -- this is a time when no other attributes are available to make a distinction and something has to be used to select one or the other.

There is a double unnest that looks strange because it's two of the same call back to back. It is intended.

I've searched the package for all instances of return() and verified that they are needed rather than leaving off a final value in the function.

dblodgett-usgs commented 1 year ago

I'd rather not import another package for table formatting. I put a little time into the print method for the name definitions.

mikejohnson51 commented 1 year ago

At worst I was thinking adding it to Suggests. I also think (not 100% sure) that if you explicitly declare the function in the readme (with ::) it can pass devtools check without being in the description.

dblodgett-usgs commented 1 year ago

Oh I see -- yeah, I can do that. There's a print method for that list in the package as well.

e.g. it now does

> hydroloom::hydroloom_name_definitions
1 "id": 
     shared network identifier for catchment divide and flowpath or flowline
2 "toid": 
     indicates to the downstream id. May or may not be dendritic
3 "fromnode": 
     indicates the node representing the nexus upstream of a catchment
4 "tonode": 
     indicates the node represneting the nexus downstream of a catchment
...
dblodgett-usgs commented 1 year ago

I went ahead and added the add_topo_sort function but realized that in the pfaf function, topo_sort and levelpath have to be in sync -- so I actually don't use it there. I did add that caveat to the documentation.

dblodgett-usgs commented 1 year ago

note that adding .data.table.aware was kind of an out of caution thing so I can use data.table syntax safely. I'll move it to where declare all my imports.

dblodgett-usgs commented 1 year ago

With the way I handle data.frame classes, I have to use st_sf() to add the sf class and minor attributes back to a data.frame when they get stripped from time to time. st_as_sf() is a bit more heavy handed and doesn't do quite what I want.

dblodgett-usgs commented 1 year ago

data.table usage is 100% worth the overhead to convert to/from tibble. Some very large joins (hi res NHD) hang when using dplyr. For most tables, the time to convert to data.table is so small that it has no affect. I could see having a switch for very large tables but don't feel that its necessary at this point.

dblodgett-usgs commented 1 year ago

With regards to hyRefactor / hydrofab -- let's try and move general functionality over to hydroloom over the long run?

dblodgett-usgs commented 1 year ago

Rounding in index_points_to_lines is on the linear reference. It rounds to the nearest hundredth of a percent. Not sure a control on that is really needed.

dblodgett-usgs commented 1 year ago

In general, I like to keep little utility function used in map/apply patterns as close to where I call them as reasonable. Once a function gets used in more places, I would consider moving to a general utility function space.

dblodgett-usgs commented 1 year ago

I like leaving the alternate implementation as a comment as in make_fromids. Nice for people who don't know both syntax (like me).

dblodgett-usgs commented 1 year ago

hahah g is used in some other packages for graph data. I kind of like it but it may also be confusing. I want to try and use g for the index id format of the network and x for the traditional hy form of the network.

dblodgett-usgs commented 1 year ago

navigate_hydro_network was to avoid a function signature collision with nhdplusTools. I kind of need to change it to something else.

dblodgett-usgs commented 1 year ago

tidyr utility functions and st_drop_geometry didn't used to do what I wanted... I've cleaned up those utility functions.

dblodgett-usgs commented 11 months ago

We are on CRAN. https://cran.r-project.org/package=hydroloom