Open ns-rse opened 1 month ago
From @MaxGamill-Sheffield
Essentially convPrune was a way to use convolutions to make the pruning more intuitive and smaller code-wise compared to topostats prune - a side-by-side refactor if you will but I can’t remember if I ever finished it, or if it’s useful / upto date any more
In light of this I will undertake the following...
convPrune
class, it likely isn't used and as noted above possibly has an error in the _prune_by_length()
method where the length of the molecule is calculated because the size of the whole 2D binary skeleton array is taken (in contrast to topostatsPrune._prune_by_length()
which uses len(coordinates)
of the skeleton).pruneSkeleton
class and simplify to a simple set of functions implementing a "factory pattern" with a single factory method which calls topostatsPrune()
, this leaves it in a state where extension will be possible if alternative pruning methods are to be implemented.convPrune
(under tests/tracing/test_pruning.py::TestConvPrune
).@MaxGamill-Sheffield Its not just renaming of variables that is causing the current test suite to fail, after doing some digging and scratching my head a lot, working out that we now need to pass all of the new options to dnaTrace()
class in and doing so from the default_config()
I found that the pruning step undertaken by get_disordered_trace()
doesn't return the same pruned objects as previously.
I've therefore marked a bunch of tests in topostats/tracing/test_tracing_single_grain.py
to be skipped until we can work out what parameter values from the existing code base (where most are hard coded) need to be set in the current configuration. There are probably other tests that fail but I'm taking one step at a time.
Its compounded by the hassle of having to take a list of co-ordinates and revert them to a 2D array to plot (I've left a note in 119b0d0 with some hacky code of how to do this as much for my own reference when I return to this later next week).
If you have any time/capacity to start investigating this it would be really useful. I was under the mistaken impression that you might have been using the existing test suite as you were refactoring and ensured that these passed or were updated as required (that is one of the major benefits of tests) but that isn't the case so we'll have to work through this now.
I've been looking at why many of the current set of tests in tests/tracing/test_tracing_single_grain.py
fail and its because the call to get_disordered_trace()
does return the same shape skeleton.
We've already identified and corrected a problem with getSkeleton()
which was mistakenly being passed the original image and instead of the mask the image after it had been passed through Gaussian filter (see 84d0896). The gaussian_filter() method is applied to the heights (self.image line 295) and this smoothed image self.smoothed_grain was being passed into getSkeleton() along with the original self.image (call starts on line 465).
I've found the use of grain
img
and mask
to be somewhat inconsistent and a possible source of confusion and as well as correcting the above problem so that an img
and mask
are passed have sought to use mask
wherever a binary array is expected and img
wherever heights are expected and remove grain
from the equation.
The get_disordered_trace()
still doesn't return the pruned skeletons we expect based on the existing tests in tests/tracing/test_tracing_single_grain.py::test_get_disordered_trace
. This only asserts that the size of the arrays are as expected and that the start and end co-ordinates are the same. Tests fail on checking the length as longer skeletons are now returned by the get_disordered_trace()
method.
This function in the current refactored code calls in order...
def get_disordered_trace(self):
"""
Derive the disordered trace coordinates from the binary mask and image via skeletonisation and pruning.
"""
self.skeleton = getSkeleton(
self.smoothed_grain,
self.mask,
method=self.skeletonisation_params["method"],
height_bias=self.skeletonisation_params["height_bias"],
).get_skeleton()
self.pruned_skeleton = prune_skeleton(self.smoothed_grain, self.skeleton, **self.pruning_params.copy())
self.pruned_skeleton = self.remove_touching_edge(self.pruned_skeleton)
self.disordered_trace = np.argwhere(self.pruned_skeleton == 1)
There are eight parameterised tests as there are four skeletonisation methods (topostats
, zhang
, lee
and thin
)
and two grains to be tested (circular
and linear
).
To investigate I've plotted the returned skeleton after pruning from the existing code on main
and under this branch
(maxgamill-sheffield/800-better-tracing
/ ns-rse/818-tests-pruning-topostats-conv
).
NB - In light of the artifacts in the images on this branch not being clearly binary (yellow) plots I've double checked that the skeletons returned on this branch are in fact binary arrays and they are.
Method | Molecule | Branch | Image |
---|---|---|---|
topostats |
linear |
main |
|
topostats |
linear |
maxgamill-sheffield/800-better-tracing |
|
topostats |
circular |
main |
|
topostats |
circular |
maxgamill-sheffield/800-better-tracing |
|
zhang |
linear |
main |
|
zhang |
linear |
maxgamill-sheffield/800-better-tracing |
|
zhang |
circular |
main |
|
zhang |
circular |
maxgamill-sheffield/800-better-tracing |
|
lee |
linear |
main |
|
lee |
linear |
maxgamill-sheffield/800-better-tracing |
|
lee |
circular |
main |
|
lee |
circular |
maxgamill-sheffield/800-better-tracing |
|
thin |
linear |
main |
|
thin |
linear |
maxgamill-sheffield/800-better-tracing |
|
thin |
circular |
main |
|
thin |
circular |
maxgamill-sheffield/800-better-tracing |
The newly introduced pruning_params
are passed in to the prune_skeleton()
call and I've double checked that in my refactoring to remove the excessive complexity that the **kwargs
are correctly passed through the factory method to the topostatsPrune
class (and they are).
I suspect some of these parameters though are different from the hard-coded method that is used on main
.
The branches on the skeletons from main
when using zhang
/ lee
/ thin
methods are to be expected since pruning was not applied (only the topostats
method prunes as it is called from within the getSkeleton
class. This is performed by pruneSkeleton
which does so by setting max_branch_length
to be 15% of the length_of_trace
(hard coded). Refactoring makes this configurable but default value of -1
used in configuration triggers this default to be set.
Looking at the getSkeleton()
class on main
the self.doSkeletonising()
method is called.
This method (see here) repeatedly calls the pruneSkeleton()
method until while self.pruning:
which may be a source of differences under the topostats
method.
In the current refactored code this repeated pruning doesn't appear to occur. Skeletonisation and pruning have been separated out and within get_disordered_trace()
(post correction mentioned above so that self.mask
is passed to getSkeleton
) the steps are...
getSkeleton(...).get_skeleton()
prune_skeleton()
remove_touching_edges()
np.argwhere()
.Perhaps the reason we're not seeing the same skeletons is because only a single round of pruning is being undertaken? :thinking:
Looked at this more this morning and have found that in the current refactored when get_disordered_trace()
the repeated pruning of branches is still in place within the pruning.pruneSkeleton._prune_by_length()
method (see line 251 where the while pruning:
loop starts
However, within this loop the max_branch_length
is recalculated on each iteration (see line 259). This means that as branches are pruned the max_branch_length
itself is reduced.
In contrast the main
branch in the topostats.tracing.getSkeleton.pruneSkeleton()
method max_branch_length
is calculated once before and outside of the (see line 780 and remains constant throughout the subsequent loops as its not recalculated.
I think this might be the reason some of the branches are not pruned when running the test suite on the refactored code (branch maxgamill-sheffield/800-better-tracing
from which ns-rse/818-tests-pruning-topostats-conv
was made).
I shall be investigating further today.
Currently on maxgamill-sheffield/800-better-tracing
and this branch ns-rse/818-tests-pruning-topostats-conv
the max_branch_length
is recalculated on each round of pruning.
In tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_test
this means that on the first set of parameters the maximum allowed branch length (in terms of number of coordinates) reduces on each iteration and a total of three iterations are run...
INFO topostats:pruning.py:282 [pruning] : Pruning by length.
INFO topostats:pruning.py:341 [pruning] : Maximum branch length : 44
INFO topostats:pruning.py:341 [pruning] : Maximum branch length : 29
INFO topostats:pruning.py:341 [pruning] : Maximum branch length : 26
This isn't the behaviour in the current main
branch (see above comment) and if we therefore calculate the max_branch_length
once outside of the while pruning:
loop we don't observe any reduction in the branch length but we still only see three iterations...
INFO topostats:pruning.py:341 [pruning] : Maximum branch length : 45
INFO topostats:pruning.py:341 [pruning] : Maximum branch length : 45
INFO topostats:pruning.py:341 [pruning] : Maximum branch length : 45
Not sure why the length is one coordinate longer to start with :face_with_diagonal_mouth:
I can understand why the length might be reassessed after pruning so am unsure if this change is deliberate @MaxGamill-Sheffield
However, we still have not pruned all of the branches that the test expected to be pruned (perhaps as we only have three iterations still). The logic for controlling
Looking deeper into this I've found that the initial skeletonisation using the TopoStats method is perhaps the underlying cause of problems we are seeing with pruning.
NB is a branch off of maxgamill-sheffield/800-better-tracing |
Molecule | Pruning Method | main |
ns-rse/818-tests-pruning-topostats-conv |
---|---|---|---|---|
linear |
topostats |
|||
circular |
topostats |
|||
linear |
zhang |
|||
circular |
zhang |
|||
linear |
lee |
|||
circular |
lee |
|||
linear |
thin |
|||
circular |
thin |
Okay this is strange, at the very least I would expect the Zhang/Lee/Thin methods to return the exact same initial skeletons, perhaps the masks that are being passed in are different.
Molecule | Method | main |
ns-rse/818-tests-pruning-topostats-conv |
---|---|---|---|
linear |
topostats |
||
circular |
topostats |
||
linear |
zhang |
||
circular |
zhang |
||
linear |
lee |
||
circular |
lee |
||
linear |
thin |
||
circular |
thin |
This shows that the masks that the refactored code are passing in to each of the skeletonisation methods are different and would go some way to explaining why we are seeing different disordered traces which are the result of Skeletonisation and Pruning^[1].
Now it is worth noting the discovery above that found that a Gaussian smoothed image of heights was being passed into getSkeleton()
as the second positional argument which should be a mask (see line 465).
Currently this has been switched to self.mask
but in light of the above and comparing to the main
branch its clear that the Gaussian filtered image (of heights self.smoothed_grain
) should be converted to a smoothed mask (self.smoothed_mask
?) for passing to getSkeleton
.
On the main
branch the step that does this can be found on line 272 where...
smoothed_grain = ndimage.binary_dilation(self.grain, iterations=1).astype(self.grain.dtype)
...and smoothed_grain
is passed as the second argument to getSkeleton()
on line 282 (note the very_smoothed_grain
that is generated does not appear to be used anywhere).
Thus we need to apply ndimage.binary_dilation(self.mask, iterations=1).astype(self.mask.dtype)
to the mask that is passed into getSkeleton()
in the refactored branch.
Somewhat confusingly there is the dnaTrace.smooth_grains()
method which is new and not currently tested in the existing test suite (it is used in the trace_dna()
method to get the smoothed image but it assigns a binary mask to the self.smoothed_image
which is conflating images and masks again. The existing tests that are under investigation do not use this method, rather they use the dnaTrace.gaussian_filter()
method...
dnatrace.skeletonisation_params["method"] = skeletonisation_method
dnatrace.gaussian_filter()
dnatrace.get_disordered_trace()
This method is applied to the self.image
i.e. heights, but again confusingly this smoothed image is saved to self.smoothed_grain
which is ambiguous so I have renamed it to self.smoothed_image
and will introduce a self.smoothed_mask
which will hopefully remove the ambiguity around what a grain
is (as it looks like its sometimes image heights and sometimes binary masks).
Another and perhaps simpler approach might be to have getSkeleton()
take only a single image, that of heights, and have it always convert this to a binary mask prior to skeletonisation. This would be simpler and reduce the complexity of passing lots of things around that need to be kept in sync.
^[1] : On the main
branch when the method is topostats
skeletonisation and pruning are combined since pruning is called directly after skeletonisation. On the main
branch the zhang
/ lee
/ thin
methods do not call pruning methods. In contrast the refactored branches pruning has been separated out of the getSkeleton
class and as such skeletonisation is under taken and pruning is then performed on the resulting skeleton, regardless of method (i.e. topostats
/ zhang
/ lee
/ thin
skeletonisation are all followed by pruning).
Having now tested the above and ensured that ndimage.binary_dilation()
is applied to the mask and that this dilated mask is passed in the refactored code now passes the same mask into the skeletonisation methods.
main
ns-rse/818-tests-pruning-topostats-conv
main
ns-rse/818-tests-pruning-topostats-conv
This is a step closer, the tests fail as there are differences in the number of points in the disordered trace but for the zhang
/ lee
/ thin
mtehods this is to be expected because pruning is now applied to the skeletons returned by each of these methods and as we can crudely see in the outuput below based on the failed tests the number of points has reduced for these methods. What we need to investigate now is why the topostats
method also reduces the number of points.
====================================================== short test summary info ======================================================
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[linear molecule, skeletonise topostats] - assert 125 == 120
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[circular molecule, skeletonise topostats] - assert 143 == 150
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[linear molecule, skeletonise zhang] - assert 122 == 170
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[circular molecule, skeletonise zhang] - assert 149 == 184
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[circular molecule, skeletonise lee] - assert 151 == 177
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[linear molecule, skeletonise thin] - assert 118 == 187
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[circular molecule, skeletonise thin] - assert 175 == 190
============================================== 7 failed, 1 passed in 224.82s (0:03:44) ==============================================
Looking in detail at the final skeletons that are produced by the refactored code now that the skeletonisation method is passed a binary dilated mask and comparing these to the skeletons returned on the main
branch.
The diff
column shows the differences between the two branches (apologies for these not being to scale, its a quick hack to look at differences)
Molecules | Method | main |
ns-rse/818-tests-pruning-topostats-conv |
diff |
---|---|---|---|---|
linear |
topostats |
|||
circular |
topostats |
|||
linear |
zhang |
|||
circular |
zhang |
|||
linear |
lee |
|||
circular |
lee |
|||
linear |
thin |
|||
circular |
thin |
We can see that for the zhang
/ lee
/ thin
methods there is as suggested above an improvement as the branches which were not pruned on main
since pruning was a method of the getSkeleton()
class are now removed on branch ns-rse/818-tests-pruning-toposats-conv
as @MaxGamill-Sheffield has refactored pruning into its own class and it is applied after skeletonisation.
This is an improvement and I would be happy to update the tests in light of these changes as from the diff
plots show the underlying skeletons have simply been pruned.
For the topostats
method though there are subtle changes in the final skeleton. The linear
one has a slightly larger loop in the bottom right hand corner and a kink on the long arm near the top left. Differences in the circular
molecule are harder to detect.
Would be useful to hear your thoughts on what might underpin these changes @MaxGamill-Sheffield and whether what is currently returned is acceptable before I dig deeper. I do appreciate you undertook this work some time ago.
I'm going to go through my current branch and remove reference to grain
and ensure that the terms image
and mask
are instead consistently used to remove any and all ambiguity about what is being passed around/used at different steps.
Sorry for the late reply here are my comments:
main
was replaced with the smooth_grains()
func which takes a grain (binary), dilates or blurs it and returns a binary mask.smoothed_grains
is in fact a mask, not an image. Or at least it should be. None of the variables used in the smooth_grains()
func are images.gaussian_sigma
value found using the hardcoded max(grain.shape) / 256
which is expected to be > 1 but on reflection this might not be the case causing the image not to be smoothed, and skeletonisation failing.Thanks for having a look @MaxGamill-Sheffield
The existing test suite which is where a bunch of tests all fail and I'm currently investigating break down the processing steps and test individually and sequentially (so we can identify at what point breakages occur when undertaking refactoring)...
dnatrace.gaussian_filter()
dnatrace.get_disordered_trace()
<optional additional steps if required>
...as that was how the workflow ran when the tests were written and because gaussian_filter()
smooths an image and not a binary mask its why the tests weren't working.
It's fine to have introduced dnatrace.smooth_grains()
method as an alternative but the tests need updating when such refactoring is undertaken so that they a) actually use the alternative method (i.e. smooth_grains()
instead of gaussian_filter()
; b) account for the changes that these may have on the test results (its ok for test results to change when we know we have modified the underlying methods).
- The variable
smoothed_grains
is in fact a mask, not an image. Or at least it should be. None of the variables used in the smooth_grains() func are images.
That might be true in your refactoring of dnaTrace.trace_dna()
(lines 170-190 of 84d08966) but in that same commit there was, as we discussed, a renaming of self.gauss
> self.smoothed_grain
within the gaussian_filter()
function when a Gaussian blur was applied to self.image
(i.e. the original height images and not a mask).
The unit-tests are failing at tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace
and this is because dnatrace.gaussian_filter()
is still being used which blurred the original image and didn't dilate the mask as you say smooth_grain()
now does. In turn the call to dnatrace.get_disordered_trace()
receives image = self.image
and mask = self.smoothed_grain
where self.smoothed_grain
is image heights and not a binary mask. Thus even though in dnaTrace.trace_dna()
the call to .get_disordered_trace()
is preceeded by smooth_grain()
rather than gaussian_filter()
the tests failed as they hadn't been adapted to account for this change.
- The reason for the non-smoothing of the grains might be the terribly low sigma value. Previously, the grains were not passed in 1-at-a-time so the gaussian_sigma value found using the hardcoded max(grain.shape) / 256 which is expected to be > 1 but on reflection this might not be the case causing the image not to be smoothed, and skeletonisation failing.
I've not got to checking the tests for this yet but looking at the above mentioned PR the gaussian_sigma
value is initially hard coded during class instantiation as...
self.sigma = 0.7 / (self.pixel_to_nm_scaling * 1e9)
...(both on main
and maxgamill-sheffield/800-better-tracing
) and it looks, as far as I can tell, that this value would have been used on main
. In maxgamill-sheffield/800-better-tracing
though this value isn't passed as an argument to the call to self.smooth_mask()
(as its coming from the **kwargs
of the parameters defined in topostats/default_config.yaml
) and it is therefore set to max(mask.shape) / 256
or the user supplied value (default null
and so default is to use the maximum mask dimension.
We can likely therefore do away with the line...
self.sigma = 0.7 / (self.pixel_to_nm_scaling * 1e9)
...as it is only used in the gaussian_filter()
method which by the sounds of it is redundant as it has been replaced by smooth_grains()
.
I'll work on tidying this up and removing redundant code and getting the tests back on track.
X-post from slack:
Hey @ns-rse , these functions (gaussian_filter()
and smooth_grains()
) aren’t the same thing, they work on different inputs and also solve different problems:
• Gaussian blurring of the image
in main was also hugely beneficial when we had the “fitted trace” step which attempted to shift the unordered coordinates onto the molecule backbone, a step which is now done using the height-biased skeletonisation method.
• Smoothing of the masks
in 800-better-tracing incorporates the grain smoothing that is done in main’s get_disordered_trace() to combat the production of spurious branches when skeletonising. We now do both a dilation and blur and see which one has affected the least amount of pixels and choose that as the smoothed mask. The “very_smoothed_grain” in main which blurs the smoothed grain is not actually used.
My guess for why the new smooth_mask() function no longer operates as expected and gives different results is because the hardcoded sigma value (! not deffine at init but in smooth_grains()!) is too small due to the new input of a grain with a shape smaller than the older input in the cats branch which was an image, and thus nothing is smoothed.
Hi @MaxGamill-Sheffield
Commits reverted so gaussian_filter()
method and get_ordered_trace()
are reinstated (although tests fail).
skimage.morphology.skeletonize()
does have a T-shaped junction which it has been suggested never arise (more on this below).Tests for
TopoStatsPrune
andconvPrune
attempt to demonstrate the effect of varying parameters has on pruning to which end the following parameters are varied...max_length
height_threshold
method_values
method_outliers
There are some issues identified so far...
convPrune
As noted in the
reason
for the first paramterised test failing which arises whenmax_height = None
(sinceint
can not be compared toNone
) I think there is a problem with the default height being used here.Line 454 tries to set the default
max_branch_length
to be 15% of the total number of elements in the skeleton array.self.skeleton.size
counts the number of elements in an array (and because we are working with 2D images this will always be the product of the number of rows and columns). But the length of this will always be1
since the total number of elements in the skeleton will always be an integer. Thus ifmax_length == -1
then themax_branch_length
will always be 0 since we would always only ever get the number of elements in the array. That said in this dummy example we don't even get that, instead we get an error raisedIts also unclear why the
convPrune._prune_by_length()
method takes the argumentmax_length
when its an attribute of the class which could be used instead.This doesn't happen in the
topostatsPrune
class because the length of the molecule is based on the co-ordinates of the skeleton rather than the size of the array.Excessive Complexity/Code duplication
The splitting of functionality here into multiple classes means there is some code duplication. Both
topostatsPrune
andconvPrune
have methods to_prune_by_length()
conditionally based on the value ofself.max_length
not beingNone
but they do so in different manners. Both methods then go on to calledheightPruning()
in the same manner.The two classes could be combined into one with an option of whether to use the original length based pruning or the convolutional approach and call the appropriate method (renaming the method from
convPrune
toprune_length_by_conv()
and the one fromtopostatsPrune
toprune_length_by_neighbours()
. In turn this would mean that there is no need to have thepruneSkeleton()
factory class (and having that as a class in itself seems like overkill when a series of functions would suffice and address the complaints from Pylint ontoo-few-public-methods
too).Handling multiple grains
The refactoring done previously in #600 removed the loops from every method/function in the tracing module so that it only handled a single grain. The looping over of multiple grains is handled by code in
process_scan.py
, this means theprune_all_skeletons()
methods can be removed too, further simplifying the code base.These will be simple to remove in due course once we have robust tests in place.
T-shaped junctions
Previous work in #835 originally tested whether
pruning.rm_nibs()
"Remove single pixel branches (nibs) not identified by nearest neighbour algorithsm as there may be >2 neighbours" and the current tests show that these are still left by the pruning methods being tested, even though the last step is to re-run the plain skeletonisation method fromskimage
(Zhang's).The following is a reproducible example that shows such a nib remains after pruning
The paramterised test with
id="Length pruning enabled (10), removes two ends leaving nibs on both."
includes the resulting array after pruning and there is a two-pronged fork of nibs when usingconvPrune()
.I think we need to ensure that
remove_nibs()
can handle such T-junctions as well as this dummy example demonstrates they are not correctly removed/pruned.