BlueBrain / morph-tool

https://morph-tool.readthedocs.org/
GNU Lesser General Public License v3.0
8 stars 7 forks source link

iter_morphology output is unsorted #59

Closed arnaudon closed 2 years ago

arnaudon commented 3 years ago

This function is sometimes non-reproducible in the ordering of morphs:

https://github.com/BlueBrain/morph-tool/blob/master/morph_tool/utils.py#L16

I think because of pathlib, maybe we can add an sorted() somewhere to ensure it is?

@asanin-epfl @adrien-berchet , what do you think?

adrien-berchet commented 3 years ago

Yes, sorting is probably the best.

asanin-epfl commented 3 years ago

As an alternative, why it should guarantee any order? If user needs order, he can sort on his side. What do you think?

wizmer commented 3 years ago

I agree with this, yes.

arnaudon commented 3 years ago

It was breaking a test randomly which was hardcoding the expected order of morphs in neuror here https://github.com/BlueBrain/NeuroR/pull/62. That was the origin of this, and I suspect it may happen again. So you think it's better to sort all tests using this function, or just sort it's output?

asanin-epfl commented 3 years ago

I'd say sort it's output where it is needed (in those failing tests).

arnaudon commented 3 years ago

What is the good reason for going through the pain of updating all tests to make sure they won't randomly fail in a future PR rather than a small change here?

wizmer commented 3 years ago

By sorting, you are complexifying the functionality and decreasing the perf. You also need to update the docstring to explicit the expected sorted ordering. It is better to keep functions simple with a minimal functionality. Moreover if a test is failing because it expects a specific ordering, the tests is not testing what it should. I don't know what is the issue in this case but you can probably fix it by change [] to {} somewhere in the expected result.

arnaudon commented 3 years ago

so the simplest is to not use that function in neuror test, and just use a custom sorted one

wizmer commented 3 years ago

what change in perf is sorted inducing?

The change is O(1) to O(N log N). It is massive. Like really massive. Try doing next(iter_morphology_files(folder)) and next(sorted(iter_morphology_files(folder)) on a folder containing a large amount of morphologies on GPFS. You could go from ~10ms to 10 sec easily.

this function is so minimal already it is near useless and seems to only breaks reproducibility

I must be using it in 20 locations lol, I wouldn't call it useless ! And I have no need of the sorting.

the test compares against {morphs[0]: result_morph0, morphs[1]: result_morph1}, where morphs is the output of this non-deterministic function, so the key/values are random and the assert fails random.

Have you tried using assert_dict_equal from nose ? That should be invariant of the ordering. Or dictdiffer ?

arnaudon commented 3 years ago

Ah ok, indeed, it's massive, I didn't realize. The issue is not comparing sorted dict, it is that the dict constructed to compare result has wrong pairs of key/values if the keys are shuffled randomly. I'll put an issue in neuror to fix all the tests then.

adrien-berchet commented 2 years ago

I guess we can close this issue?