feat: when loading HIFLD transmission lines, join disconnected line segments

danielolsen commented 2 years ago

Purpose

The HIFLD transmission line paths are defined by either: a list of lat/lon points (which I'll call a 'segment' from hereon out), or a list of such lists. The Jul 2020 dataset values seem to always be a list of segments, but the February 2020 dataset is either a single segment (not wrapped in a list) or a list of segments.

Previously, when we had a list of segments, we always grabbed the first one, discarding all the rest. This PR changes the logic so that if we get a list of segments of length greater than one, we make an assumption about how to reconnect them. The logic is a little complex because the segments aren't 'drawn' in any particular order, don't necessary connect, and there are some small noisy segments that need to be discarded.

What the code is doing

Within helpers.py, we copy over a couple of helper functions for calculating distances between coordinate pairs.

Within load.py, we add a new function _join_line_segments to take a list of segments and parse them.

There is a private function _generate_linear_spanning_tree which:
- creates a graph with each segment's endpoints as nodes and an edge between the start and end of each segment
- creates edges between each endpoint of each segment and each endpoint of each other segment, weighted by distance
- find the minimum spanning tree of the graph
- raises an error if the minimum spanning tree isn't a chain (not sure if this is the right graph theory terminology or not)
- returns the minimum spanning tree if not
The list of segments is passed to _generate_linear_spanning_tree, and if the tree isn't a chain, the smallest segment is dropped and the process is repeated until we have a chain tree.
The final list of segments is processed to put all of the coordinate pairs into the proper order, by traversing the tree and flipping the original segments if they go from 'end' to 'start' rather than 'start' to 'end'.

Testing

Tested manually. Calling get_hifld_electric_power_transmission_lines works, as well as create_grid.

Time estimate

30-60 minutes. The code is not too complex, but the logic is, since it needs to handle many edge cases found in the source data.

BainanXia commented 2 years ago

This is interesting! I'm wondering the performance of this algorithm. What is the maximum number of segments for a line in the dataset? We are creating a pretty dense graph by adding edges between all end-points of segments for a line. The overall time complexity is bounded by O(N*V^2logV) where N is the number of lines, V is the number of maximum segments. Not sure how long it takes to go through the current transmission dataset.

Just another random thought (might be a bad idea;)): assuming all the line segments roughly form a straight line rather than circles, would it be sufficient to project one of the end points of each segment onto an axis in a 3D cartesian coordinates system (latlong_to_xyz) to get some sort of partial ordering? For example, consider three segments with start points being (x1, y1, z1), (x2, y2, z2), (x3, y3, z3), sort the points based on the projections of the axis with maximum span (suppose max(x1, x2, x3) - min(x1, x2, x3) > max(y1, y2, y3) -min(y1, y2, y3) > max(z1, z2, z3) - min(z1, z2, z3), then x axis is chosen). In this way, a partial ordering of segments is determined. It won't tell us the start/end info and not able to fill gaps between segments though. Also, not sure how it performs over the noisy source data.

danielolsen commented 2 years ago

For the July 2020 dataset, we have 89,744 total lines, of which 80 need joining. On my laptop, this step takes about three seconds to get through all the lines (including the single-segment lines where we're just identifying that the list has length one and returning the first element). The vast majority of these lines have either 2 or 3 segments, although there's one line with 5, one line with 10, and one line with 33. I believe that when I was looking at the February 2022 dataset, I didn't see a 33-segment line, but I did see a 14-segment line (maybe in its place?).

BainanXia commented 2 years ago

@danielolsen Thanks for looking into the statistics, <0.1% is super light. Then I presume this is minor. Probably it isn't worth the time to explore other possibilities. Given what we have works for now, let's stick with it.

Breakthrough-Energy / PreREISE