BlueBrain / MorphIO

A python and C++ library for reading and writing neuronal morphologies
https://morphio.readthedocs.io
Apache License 2.0
26 stars 24 forks source link

Allow for sanitization bypass for legacy #235

Closed arnaudon closed 3 years ago

arnaudon commented 3 years ago

It may be useful to have a hidden option to not sanitize (at least unifurcation merge), for legacy mostly. There are two branches doing that curently: https://github.com/BlueBrain/MorphIO/tree/no-sanitize https://github.com/BlueBrain/MorphIO/tree/unifurcation

this option should be very hidden, so nobody can do it easily ;)

mgeplf commented 3 years ago

As evidenced by the branches that do it, it's something that's come up a few times.

I think we should decide what the behavior is going forward, because the auto-magic merging has bitten us a number of times.

My view is that when a morphology is loaded, it should represent what is on disk as faithfully as possible; so as to not confuse unnecessarily the consumers - for instance the changing of section_ids post-sanitization. A user could choose to sanitize post loading if they wanted to, but that could be a section function call or a loading option - not the default.

The benifit of this is that it is backwards compatible with how morphs have been written, compatible with the morphs that are out there, and doesn't put the onus on the consume to track the changes if they need to deal with legacy.

I would like to hear/get a consensus from other consumers: specifically the HPC team (@matz-e, and anyone you recommend) and the VIZ team (@NadirRoGue, and anyone you recommend)

On the write side, when a format is being written that cannot support certain features (ie: SWC can't handle unifurcations), I believe this should raise an error, so the user can choose what to do (ie: they have to run morph.sanitize() or eliminate_unifurcations(morph). With in the morphology-release process, this could be required - however, that only changes release going forward.

NadirRoGue commented 3 years ago

I agree with @mgeplf . An option to choose whether branch merge should be applied or not would be quite useful. For visualization, we should be able to show what the actual morphology looks like.

Also, as @chevtche can explain, branch merging would affect the results of other programs, such as LFP calculations on EMSim. (At the momment EMSim relies on Brion to read morphologies, which relies on the unifurcation version of MorphIO)

matz-e commented 3 years ago

Same here, I second/third @mgeplf. @alexsavulescu and @pramodk could also give their opinions?

alexsavulescu commented 3 years ago

All in favour on my side. Older discussion here: https://github.com/BlueBrain/MorphIO/issues/175 Please note that https://github.com/BlueBrain/MorphIO/tree/unifurcation can be considered as a dirty-fix, since I was having some issues with implementing the CMake option (don't remember which ones, I vaguely remember due to switching between mutable/immutable).

Additionally there is no way as of now to know if a circuit has unifurcations or not before simulation time. We could add a flag somewhere from now on and display a warning accordingly.

chevtche commented 3 years ago

Yes, if you merge some branches in the morphology you lose the one to one mapping with the report files. The result is a nice crash most of the time !

mgeplf commented 3 years ago

Ok, it seems that this feature is needed. Can we agree on how it should be implemented:

1) Default is to load w/ no sanitization: have a function to sanitize 2) Default is to load sanitized, have an option to load unsanitized

@arnaudon, @NadirRoGue, @matz-e, @alexsavulescu, @chevtche?

alexsavulescu commented 3 years ago

How about introducing a flag:

  1. Check for flag if sanitized or not and load accordingly. If flag not present, then 1.

Otherwise I'd go with 1, since that would be the most common case?

matz-e commented 3 years ago
  1. Default is to load w/ no sanitization (unless a dissenting option is given): have a function to sanitize

FTFY as @alexsavulescu said…

NadirRoGue commented 3 years ago
  1. Default is to load w/ no sanitization (unless a dissenting option is given): have a function to sanitize

I agree with this. And just to understand it better, this is a runtime option I guess? Or is it a CMake flag?

mgeplf commented 3 years ago

Check for flag if sanitized or not and load accordingly. If flag not present, then 1.

Yeah, makes sense.

this is a runtime option I guess? Or is it a CMake flag?

Definitely a runtime option.

arnaudon commented 3 years ago

From discussion with @wizmer , the fact that unifurcations can be considered as errors related to a specific file format (not present in swc for example), and that unifurcations do not make any sense from a mathematical point of view (neurons are rooted trees, which have degree 1 for termination, 3 for branches, and n for root node), I would be in favor of having sanitize by default, and a hidden flag to deactivate it, in the very specific cases of using legacy reports/asc morphologies done before sanitize in MorphIO. These legacy use cases should progressively disappear with time I guess. Maybe something hidden like that?

import morphio

morphio.keep_unifurcations = True
mgeplf commented 3 years ago

the fact that unifurcations can be considered as errors related to a specific file format (not present in swc for example),

That should be raised when trying to write an SWC file.

unifurcations do not make any sense from a mathematical point of view

It's my view that MorphIO is a low level library that should faithfully represent what is on disk - if someone wants to impose different rules, they can do so, but it shouldn't be the default behavior.

For example, getting everyone to agree on what 'sanitized' means would be difficult, if not impossible (unifurcations is only one potential sanitization), so it needs to be up to higher level consumer libraries to implement what they need. I compare it to a text editor: if you opened a txt file, you wouldn't want all the spelling errors corrected; you want to see what is on disk as much as is possible - and you can choose to fix errors or formatting.

morphio.keep_unifurcations = True

Global flags like this are tricky, so I'd prefer to avoid that - if you have a tool that uses a library, one would have to always check this flag to make sure it hasn't been set by another piece of code, which seems error prone.

chevtche commented 3 years ago

I strongly beleive that anything that is changing the data you are loading should be disabled by default. MophIO is used as a loader, it is not supposed to silently "repair/break" stuff. I think that will create tons of problem for people that are not aware of this "sanitization". We already lost a lot of hours understanding what was wrong when we were trying to load report files. In general, any morphology related files created using the data on disk will potentially fail by default if the data in memory is not the same as on disk by default.

arnaudon commented 3 years ago

I understand very well your points! It's not my decision anyway. Just a few things to consider:

mgeplf commented 3 years ago

Thanks for the input everyone, it's important to me that if we make a change in, everyone is on board.

I think that will create tons of problem for people that are not aware of this "sanitization". We already lost a lot of hours understanding what was wrong when we were trying to load report files.

This is what I'm trying to clear up and to avoid; this has resulted in extra effort and lost productivity.

It's not my decision anyway

I'm trying to have everyone understand the tradeoffs, and be involved in the decision; it's important to have representation from multiple users of MorphIO so we cover things properly.

the real sanitize is here

Which, to me, makes sense: this is a higher level library that can leverage MorphIO, and apply what is necessary.

I guess there was a good reason why you added this automatic 'sanitize' in the first place in MorphIO, but I wasn't part of that discussion, so I don't know the rationale behind it at that time

I think with the experience we have now, this thread can be considered an update to whatever discussion that was held in the past.

if MorphIO does something to the morphology upon loading, big warnings appear in the logs

But there is no way of recovering the underlying data; back to the text editor analogy: if your editor popped up a dialog saying 'we fixed all the spelling errors', it would be quite frustrating to use, no? I'd rather it loads my file, and allows me to run a spell checker, and save the file.

at least in BBP, all morphologies must be processed by us

That is the policy going forward, though, but as you noted MorphIO is used with legacy circuits, and morphologies outside the control of BBP, so I having it represent the unmodified morphology seems like the best bet.

If the morphologies have already been sanitized, they will load correctly (ie: already sanitized) unmodified, and if otherwise, they will load, and contain whatever warts they do on disk, but it's up to the user to perform what sanitization they see fit.

arnaudon commented 3 years ago

So would that be a valid solution?

matz-e commented 3 years ago

but produce warning with an explanation on how to fix

Please don't produce unnecessary output. If you read a large circuit with old morphologies with any MPI program, the program's output is going to be obliterated by warnings. Warnings most users don't care about and that will obstruct and bury other, more essential information. Users should pass a sanitize=true (or similar) flag, or have another attribute to query if the morphology is clean/sanitized.

alexsavulescu commented 3 years ago

Same type of warnings can be suppressed so as not to pollute logs.

matz-e commented 3 years ago

Still not a fan. If there's a warning and I want to preserve one, that's going to be one per process with MPI… which works out to be still several thousand warnings in some cases. I also don't expect libraries to print stuff by themselves.

arnaudon commented 3 years ago

Wait, is the current version of MorphIO causes you any trouble in this regard? Here is how one can deactivate all such warnings: https://github.com/BlueBrain/MorphIO#maximum-number-of-warnings

I'm not sure where MorphIO is used, but the usecase I had with Armando was not a problem for him, I think he runs simulations with neurodamus, which does not seem to use MorphIO, as his pre-sanitized morphologies still work with his circuit.

Also, these warnings are not really warnings to pollute logs, these are proper error messages one has to address. In very very rare legacy cases, one could be able to bypass them, but going forwards, there will not be any, as explained above.

matz-e commented 3 years ago

Also, these warnings are not really warnings to pollute logs, these are proper error messages one has to address.

Shouldn't the program fail, then? I've seen too many scenarios where everyone just thinks their library should print some errors as warnings, mixed in with a bunch of "informative" output. Things get lost easily…

wizmer commented 3 years ago

Having a flag to bypass sanitization is a recipe for disaster

That would mean that the same morphology file has multiple representation. Taking the code editor analogy, that would mean that the content of the file "I like apple" could be "I like apple" or "I like oranges" depending on the flag you used to open your editor. I'm pretty sure you would not like that.

In the BBP context, that would mean section IDs are useless if you don't know in which mode the file was opened. An example: @matz-e is running Touch Detector and noticing an unusual behavior on section ID 27. He asks the viz team and @chevtche to display him section ID 27 with Brayns. It turns out Brayns use the sanitization flag and Touch Detector does not so the section 27 in Touch Detector is not the same as the section 27 in Brayns. @chevtche and him will lose lots of hours trying to find out what's wrong until they discover they are not reading the morphology with the same flag.

I'm telling you. I will no longer be there to see it, but my prediction is that Jira will be filled with tickets related to section ID mismatches that are caused by inconsistencies in the use of the sanitization flag.

Sometimes multiple files have the same representation

Taking the analogy of the code editor again. Multiple utf-8 characters may end up being represented by the same glyph in your code editor. And no one complains, and that's perfectly fine.

I strongly believe that:

(0 0 0 1)
(0 0 1 1)
(0 0 2 1)
(0 0 3 1)

and:

(0 0 0 1)
(0 0 1 1)
(
  (0 0 2 1)
  (0 0 3 1)
)

represent the same entity. My guess is that the extra parentheses that are sometimes found are an artifact caused by NeuroLucida.

What is a section ?

The MorphIO definition of a section is:

Section: a section is the succession of points between two bifurcations.

Correct me if I'm wrong but I think the definition is pretty clear. MorphIO is not lying on the object it gives you. However MorphIO does not claim the sections ID will matches section IDs you generated or got from another library. Oh, no. MorphIO is certainly not responsible for this. And this has never been part of the requirements.

Not sanitizing means that the previous definition is no longer true. Instead, it would become:

Section: a succession of points

So yes, that's true. MorphIO now offers a lower level object. But that's not a section, let's call it a portion in the following. Designing an API is all about finding the correct balance between low level and high level.

And in the case of the new definition, I think it is too low level. Since you can no longer assume that a section starts and ends with a bifurcation, this new object is way less useful in the majority of the cases.

Examples:

I have been working on morphologies for 3 years and not a single time I have needed to access a portion. Never. Ever. The entity useful for science is the section, not the portion.

The real issue

It is reading section IDs not generated with MorphIO. MorphIO is self consistent, the issue only happens because we are trying use MorphIO with section ids generated with another tool.

Solutions

  1. A sanitization flag As already said, that would amplify the potential of section ID mismatch 10-fold.

  2. Re-index older circuit This is pretty costly, but this is a one time operation. And then, this issue should be gone forever.

  3. Implement Morphology::portion A portion would be an un-sanitized section. It seems different people want to use different objects. Let's let them do so. We could have Morphology::portion in addition of Morphology::section. It would return you a Section object as well but without any garanty that there is a bifurcation at the start and end point. There is still a potential of mismatch of course, but it can be reduced by using the proper semantic.

I think I like 2) a lot. I am scared by the consequences of 3), but maybe it's not so bad. The thought of 1) makes me shiver the most.

Bonus

Shouldn't the program fail, then? I've seen too many scenarios where everyone just thinks their library should print some errors as warnings, mixed in with a bunch of "informative" output. Things get lost easily…

I think we should discuss this somewhere else. This is a good question though.

arnaudon commented 3 years ago

+1 ! to me, the best option is to leave it as is, and maybe rebase one of the no-sanitize branch I mentioned above, in case one need to preserve old, un-reproducible circuit built. I would not go for a portion of sections, and as mentioned above, if option 1 above is chosen, please make it very very very hidden, so only very few people can use it and it will limit the problems benoit mentioned. The most hidden being in a separate branch I guess?

NadirRoGue commented 3 years ago

I dont agree with that proposal. MorphIO is an I/O library, it reads and writes to disk, its not its task to perform any operation on the morphologies (Why not rename it to MorphIOFixer otherwise). That should be left to other libraries/software whose purpose requires it. Being able to read whatever is on disk, and then apply any sanitization operation on demand is the way to go for me. And then fix the source of all this problems, and/or make sure this type of files are not produced again. On the other hand, if the library sanitizes the morphology without any option for the library client to modify that behavior, we lose the possibility to read what it is really on disk.

arnaudon commented 3 years ago

As is morphio now, could you list the issues you had with this automatic sanitize? Suppose we didn't start this thread, would you need any changes to MorphIO?

NadirRoGue commented 3 years ago
  1. We spend 4 whole days debugging to know why EMSim was failing to compute the LFP on a test circuit which always worked before. The reason was that Brion switched to read the morphology files using MorphIO, which was applying this sanitization and breaking the 1 to 1 mapping with the compartment report.
  2. For visualization we might want to see what is on the disk, which currently is impossible
  3. For me the most important reason: It is an IO library. IMO making an IO library do non IO stuff is the first step towards a bad and confusing design.
matz-e commented 3 years ago

@matz-e is running Touch Detector and noticing an unusual behavior on section ID 27. He asks the viz team and @chevtche to display him section ID 27 with Brayns.

Different take: I don't want to bother Viz/already have my some blender scripts sitting around. They read a vanilla HDF5 file, because the format is quite easy to understand. I render what I think is section 27 by reading the offsets of rows 28, 29 of the structure and extracting the point cloud. Now what did I really look at?

Or, using the editor analogy: I look at a file in MorphIO, and all I see is British English. I decide to do some quick manual investigation, cat the file, and suddenly am missing a lot of spurious us because the on-disk format is actually American English… (or worse, Esperanto).

As is morphio now, could you list the issues you had with this automatic sanitize?

I moved our custom HDF5 readers over to MorphIO. I made sure that they were doing exactly the same thing. Then a few months or a year later, this "sanitization" magically turned on and wrecked my CI. 0 warning. I would have to fix every single circuit file, adjust report ids… how do you track down every single file involved in legacy circuits? Not to speak of references in personal notes, examples, etc?

I dont agree with that proposal. MorphIO is an I/O library, it reads and writes to disk, its not its task to perform any operation on the morphologies (Why not rename it to MorphIOFixer otherwise). That should be left to other libraries/software whose purpose requires it.

I like this point. I was being told before that we could not have certain features in MorphIO because it was a pure I/O library. IMO, sanitization also falls into "not purely I/O".

IMO, we have legacy circuits, and those should be treated as before. Yes, it does not make sense that their sections are really just portions. But breaking compatibility this way, without a way to track and fix every single entity that uses old section IDs is not really OK in my book. We have the same "representation" that changed historically.

Going forward, we can ensure that every morphology we use has no unifurcations, and all the whole whizbang about being "proper". But that should be a separate step that, e.g., external files have to go through, and then we should use a different identifier to make sure whatever we use uniquely references a sanitized morphology.

alexsavulescu commented 3 years ago

Would the following cover most/all aspects of the matter at hand:

arnaudon commented 3 years ago

Thanks for all the details, indeed, it causes some issues.. Isn't this solution the same as https://github.com/BlueBrain/MorphIO/issues/235#issuecomment-785008940 ? We got diverted from it with warning logs ;)

alexsavulescu commented 3 years ago

Indeed we got sidetracked :) Just to stress that as long as sanitization is done at write time and on-demand, reading should become transparent, right? As for warning logs, we could add isSanitizable() and those interested in performing sanitization can call it offline.

NadirRoGue commented 3 years ago

Indeed we got sidetracked :) Just to stress that as long as sanitization is done at write time and on-demand, reading should become transparent, right? As for warning logs, we could add isSanitizable() and those interested in performing sanitization can call it offline.

+1

wizmer commented 3 years ago

MorphIO does not "sanitize" anything. It is not true. You can't say it "sanitizes" if you don't provide a specification first. MorphIO has an API that exposes Section objects. Where the definition of a Section being:

Section: a section is a succession of points between two bifurcations.

If the definition would be Section: a succession of points wrapped in an S-expand MorphIO would return what it currently returns, then you could say there is a sanitization. Because it would transform the object before returning it.

But here, it exactly returns you the object that matches the definition above. As per the above definition:

((Dendrite)
 (0 0 1 1)
 (0 0 2 1)
 (
   (0 0 2 1)
   (0 0 3 1)
 )
)

has a single section and sanitization has nothing to do with it. There is absolutely no ambiguity.

I know you guys, just wanna say "sanitization go brrr, all section ids are messed up". I understand this. This is very valid. And this is a very valid issue !

All I'm saying is that not "sanitizing" is the equivalent of changing the definition of a section. This has very detrimental implication to people actually working on morphologies. Most of the people in the thread are not actually working on morphologies but simply need to fetch section ids. You may not think changing the definition is a big difference, but for people that actually study morphologies (ie. compute morphometrics based on them) this is a massive change. All the unit tests are going to break actually. And I believe that the new definition is less usable for those people. MorphIO has been designed to abstract file formats from the scientists so that they can reason easily about the topological trees that are the neurite. Having a matching of the section ID with the older readers has never been part of the specification. To tell the truth, when I designed MorphIO I did not even know section ids were stored externally !! Having a one to one match of the section IDs provided by another tool (which one, I don't even know !), has never been in the scope of MorphIO. Section IDs are not universal, there is no absolute section IDs when reading a graph. The real issue in this ticket is not sanitization, it is that you are annoyed by the fact that MorphIO does not provide a one to one matching with the section IDs of the whichever other tool.

That's why I proposed to implement a new method Morphology::portion that will return the raw s-exps while keeping the Morphology::section object as it currently is. To me these two objects correspond to two different concepts, used by different people. A portion (the object you want to have) is of no use when studying morphometrics while a section is of no use if you want to deal with legacy circuits. To me, it totally makes sense to expose both.

TLDR: I just want to highlight that this thread is not about "sanitization" but about changing the API and touching the core definition of what is a Section.

I can write the new definition for you:

Section:
- A succession of points between two bifurcations if the file is in SWC format.
- A succession of points (not necessarily between two bifurcations) in an S-exp if the file is in ASC format.
- A succession of points (not necessarily between two bifurcations) corresponding to a block in an H5 format.

instead of:

Section: a succession of points between two bifurcations

Pinging @eleftherioszisis @lidakanari @wvangeit @mariarv @jacquemi-bbp

mgeplf commented 3 years ago

I think we've covered the main points, which should hopefully be helped w/ the meeting we've set up (please forward that to anyone you think has an interest).

A few points:

And we don't care that NeuroLucida added an extra set of parentheses when writing the file.

The fact that this can happen, and does happen, suggests that scientists want 'sections' with unifurcations. Since neurolucida is the industry standard, I think we need to handle it as well. It makes the definition of section less concrete, which is frustrating, but it should be up to consumer of a morphology to ask for this to be corrected.

Everywhere where someone uses MorphIO to iterate on sections (with the actual definition), they would need to add:

Yes, that is what a higher level library can do, or it can run a MorphIO::remove_unifurcations function. But it is the choice of the higher level library.

MorphIO is self consistent, the issue only happens because we are trying use MorphIO with section ids generated with another tool.

We want MorphIO to not just be self-consistent, though. For instance, there has been talk of using MorphIO in NEURON, and arbor. If NEURON did indeed start using it, and ASC files are being 'sanitized', people's output would change, as well.

Re-index older circuit: This is pretty costly, but this is a one time operation. And then, this issue should be gone forever.

It's not just the circuits, it's somehow fixing all the simulations that have ever been run, and every analysis that has ever been run on those simulations.

Most of the people in the thread are not actually working on morphologies but simply need to fetch section ids.

I don't think that's fair; MorphIO isn't just targetted at one set of users. It's supposed to be the bedrock of morphology loading, such that we don't keep writing the same parsers over and over; if we make it hard/impossible to use for a particular task, then when end up back in the world where people write their own parsers.

So, we're back to the original point, how do we handle this, so that everyone can get their work done.

The options are:

Hopefully we can answer that in the meeting; if there are any other options that are missing, please add them to the ticket.

wvangeit commented 3 years ago

I know we have a meeting about it tomorrow. But let me just give my point of view.. Imo the problem starts here: "that unifurcations do not make any sense from a mathematical point of view (neurons are rooted trees, which have degree 1 for termination, 3 for branches, and n for root node)". This is in my opinion a too narrow definition. Yes, for 'most' of our use cases this currently applies, but it's not true in general. Sections have a very specific use in Neuron, i.e. they have mostly the same ion channels etc. E.g. the axon hillock, initial segment, myelinated part etc, can all be consecutive 'without' any branch point, and still they are all separate entities. Especially if you're talking about the Neurodamus section-ids, this becomes super important. Because we might to cut after this sections, or record from these explicitly. I know SWC doesn't really support this, and I consider this one of the problem of SWC. If MorphIO would ever read HOC morphs, this kind of meaningful unifurcations could be everywhere.

This is just anecdotal, but it makes me support the people who say MorphIO should not impose a model on morphologies (or at least make it as broad as possible). So it should not sanitize unifurcation away, it should write exactly what it received, if it's not possible it crashes and asks people to specify a flag/option somewhere. Should we e.g. for SSCx all start from a set of morphs without unifurcations? Probably. But this unifurcation sanitization process should not be an automatic part of reading/writing morphs.

mgeplf commented 3 years ago

Thanks everyone for a productive meeting.

The decision was: 1) MorphIO will only load 'raw' morphologies; no unifurcation collapse will happen. It is only making what is available on disk accessible; this way there is no ambiguity on what the contents of a morphology are when different applications read one, as long as they are using MorphIO 2) Higher level libraries need to be aware that they have to deal with this, for instance I've made the following NeuroM ticket: https://github.com/BlueBrain/NeuroM/issues/872 3) Any curation that is required (including unifurcations) will be handled by other high level libraries, and they can choose how to write to disk; this is outside of the scope of MorphIO reading.

We'll make the source change, and release a version in the next short while.

tomdele commented 3 years ago

I think we fixed all the problems here and we released the : https://github.com/BlueBrain/MorphIO/releases/tag/v3.0.0

that includes the changes about sanitization.