DOI-USGS / hyRefactor

https://code.usgs.gov/wma/nhgf/reference-fabric/hyrefactor
Creative Commons Zero v1.0 Universal
5 stars 0 forks source link

refactor ID assigment. #36

Closed dblodgett-usgs closed 2 years ago

dblodgett-usgs commented 2 years ago

Use an integer mainstem ID string appended with an integer sequence in toposort order.

dblodgett-usgs commented 2 years ago

Needs to happen here: https://github.com/dblodgett-usgs/hyRefactor/blob/main/R/reconcile.R#L69

dblodgett-usgs commented 2 years ago

Will group by levelpathi arrange by hydrosequence then assign an ID as described above.

Result will be a mainstem ID e.g. 1234 and sorted catchment ids like 1, 2, 3, 4 where 1 is the outlet and 4 is the headwater will result in ids like 12341, 12342, 12343, 12344, ... and so on.

dblodgett-usgs commented 2 years ago

This is probably a terrible idea... but could do it like this too: paste(as.integer(charToRaw("123-4")), collapse = "")

dblodgett-usgs commented 2 years ago

Latest thought of how this could work would be:

1) Mainstems will be a number from 1 to the single digit millions. If we were to pad mainstems to start above the greatest possible mainstem id,

1 becomes 10,000,001 (commas for clarity only) 1,234,567 becomes 11,234,567

Each catchment along each mainstem within this would be numbered 1:n.

So the first catchment of mainstem 1 would be 10,000,011 and catchment n would be 10,000,01n.

A 32 bit integer can go to 2,147,483,647 so using this scheme, if n were less than 1000, 1,000,001,999 we are ok but for n > 1000, this is no longer a 32 bit integer and won't have the performance benefit of an integer ID.

Checking the Mississippi, even in 1:100k scale data, it's over 1500 catchments. Even for (12,345,67)(1,500) which would be required for a base mainstem set over 1 million and an allowance for over 1000 catchments per mainstem, we have issues.

This leads me to think we should use a character ID. Possibly just use a dash between the mainstem ID and the flowline. There is a complication in that we run some mainstems in multiple regions though.

More to think about. @mikejohnson51 -- have you come up with any other ideas?

mikejohnson51 commented 2 years ago

After our discussion last week, I think this should work and meet all needs:

  1. While creating VPU-level refactors use the existing integer ID AND create character ID based on a concatenated mainstemID-integer (e.g. 1234-1, 1234-2). This provides a computational ID and a "interpretable" ID.

Replacing the random ID generation with a "grouped topo sort" would be nice to increase reproducibility.

  1. When combining VPUs into a CONUS layer, the character ID remains unique but the integers are not. So, remove the integer ID, leaving just the characterID.

  2. Then, new unique integer IDs can be created (as.integer(as.factor(characterID))). If you really really wanted to get clever, you could sort the factors by mainsteam, and sub-mainstem ID providing a national "topo" sort (I think).

This would provide a simple way to get CONUS wide unique integer and character ids?

dblodgett-usgs commented 2 years ago

Yeah - I like this. We can leave ID smarts outside hyRefactor this way as well. I really don't want to touch smart IDs if I can help it at all.