Unique ID overflow - Githubissues

Sebastien-PILUSO commented 5 months ago

The new unique IDs generated by the atlas pipeline has its own pre-defined limit (https://github.com/BlueBrain/atlas-splitter/blob/main/atlas_splitter/utils.py#L12-L13). This results into 10-digits figures which can lead to an overflow for most current suftwares used in neuroscience in general. I am providing two examples of softwares: ITK-snap which is widely used by hospitals in neuroscience (coded as uint16 apparently) and BioExplorer developped here at BBP. Both softwares meet an overflow issue when trying to observe the atlas data. An illustration of the potential repercussions resulting from that overflow, even using in-house recently developped softwares such as BioExplorer from BBP: see the layers 2/3 plus the barrels which have abnormal zero neuron density value in the joined figure.

So I would definitely suggest that we put efforts for trying to make at least our IDs visible in most softwares. To do so, it seems my proposal 1 is interesting:

Identify a large range of unused figures: 7000-8000 for instance
Iterate on it: 7001, 7002, 7003, … In total, the Allen Institute has a bit more than 1000 ids in the hierarchy -which is upposed to be stable), on which we added about 100 new IDs. So I supppose this is feasible to identify such a range in the hierarchy. Selecting such a range has the benefit to make things clear for the scientific comunity: new IDs added in that range correspond to BBP adding.

Capture d’écran du 2024-05-21 18-06-14 b9cf6dd)

eleftherioszisis commented 5 months ago

The data type of the ids in the hierarchy should not be determined arbitrarily by the existing ids in the hierarchy.json but from the data type of the nrrd annotation volume, provided by Allen, which I have the impression is uint32. @Sebastien-PILUSO is that correct?

mgeplf commented 5 months ago

We can't work around ITK-snap, as mentioned above it's 16bit. I'm ambivalent about a change; I think that fixing BioExplorer shouldn't be too bad. I'm not sure why we would change to signed numbers, as we won't have any negative numbers, though.

Sebastien-PILUSO commented 5 months ago

@eleftherioszisis thanks for your message. Let me re-explain (see the second figure for illustration).

(A) Most of Allen's IDs (more than 90%) are between 1 and 32,767, ie most softwares are able to read them in int16 without any trouble. (B) Some Allen's IDs (less than 10%) are between 32,767 and 2,147,483,647 (int32). Most software, particularly visualisation software, encounter problems reading such high IDs. (C) DKE's unique IDs generated are between 1,005,899,076 and 3,999,683,881, so same than above, most software fail into reading such high numbers.

Hard rules are: (1) The hierarchy is supposed to be stable, ie the IDs are not supposed to change. (2) We cannot change Allen's IDs.

So my concern is here: given (1) and (2) and as we have the possibility to select the range in which those unique IDs are going to be generated, I highly recommend to define them within the first range of values (A) for more genericity, ie just changing the minimum and maximum values in the generator as pointed out. To me, this is independent of any data type in files such as the annotation volume. The idea is just to ensure that as many people as possible can read and properly visualize the IDs we produced.

@mgeplf thanks for your comment. Maybe you can't work around ITK-snap, but that's what scientists do. Fixing BioExplorer is not the solution either. There are so many other visualisation software packages that are not compatible with such high IDs, sometimes even developed in-house in laboratories. The data we produced are not supposed to be readable only in our platform; on the contrary, they are intended to be widely used externally among the scientific community. And our responsability is to make them as accessible and generic as possible, to encourage their use. That's why I believe that this small effort is worth it.

mgeplf commented 5 months ago

Like I said, I'm ambivalent. Since the ship has sailed on the IDs being compatible with ITK-snap, there's not much we can do. Not exacerbating the problem by staying with the range of unsigned int 32 seems reasonable.

jdcourcol commented 5 months ago

ITK-snap which is widely used by hospitals in neuroscience (coded as uint16 apparently)

What is the source for that ? I did not find anything with google.

eleftherioszisis commented 5 months ago

@Sebastien-PILUSO , yeah I understood all that and the issue at hand.

The reason behind my asking for the data type in the nrrd annotation volume was mainly to clarify if Allen has provided a specification of a range of reserved ids they would possibly use in the near or far future. The annotation I inspected, which is part of the previous atlas release, was indeed storing uint32 ids. However, I don't know if this is what was originally provided by Allen.

Arbitrarily picking ids inside that range could result in collisions in the future. However, if we create a static mapping of the ids, and add collision checks, then we could fit the new ids within the range that works with ITK remapping of the values into its int16 internal representation.

The internal representation of ITK snap is an implementation choice to save up memory and is not how images are necessarily stored. What it does is linearly remapping different types of values into its internal int16 representation. http://www.itksnap.org/pmwiki/pmwiki.php?n=Documentation.PrecisionWarning

jdcourcol commented 5 months ago

What it does is linearly remapping different types of values into its internal int16 representation.

so this means that we cannot have more than 65,535 values right ? So where is the overflow happening with itk-snap ?

eleftherioszisis commented 5 months ago

I tried loading the latest annotation from staging:

It emitted a warning about the precision loss due to the remapping in the int16 range.

Sebastien-PILUSO commented 5 months ago

@eleftherioszisis thanks for the precisions.

if Allen has provided a specification of a range of reserved ids they would possibly use in the near or far future.

The Allen Institute did not provide any specification of a range of reserved IDs they would possibly use in the near or far future because of rule (1) I mentioned in my previous comment.

I don't know if this is what was originally provided by Allen

I don't know about the previous version you inspected, but the version currently released is indeed storing uint32 values in it, sticking to the native Allen atlas version.

Arbitrarily picking ids inside that range could result in collisions in the future.

I don't see why one particular ID picked in data range (A) could result in collision more than any other picked in the current range defined by DKE. No rule has been given stating that IDs in data range (A) are more likely to be used by the Allen later.

However, if we create a static mapping of the ids, and add collision checks, then we could fit the new ids within the range that works with ITK remapping of the values into its int16 internal representation.

Yes this would be great!

The internal representation of ITK snap is an implementation choice to save up memory and is not how images are necessarily stored.

Exactly, and that's precisely the point I was making: most visualisation software will use the same technique to save up memory. All the more reason to make this change and avoid any bad consequences.

jdcourcol commented 5 months ago

It emitted a warning about the precision loss due to the remapping in the int16 range.

ok, but there is no overflow.

jdcourcol commented 5 months ago

I don't see why one particular ID picked in data range (A) could result in collision more than any other picked in the current range defined by DKE

currently, the id defined by DKE is a hash in that huge range . If we select a smaller range for the available value for that hash, the likelihood of a collision is higher

jdcourcol commented 5 months ago

Some Allen's IDs (less than 10%) are between 32,767 and 2,147,483,647 (int32). Most software, particularly visualisation software, encounter problems reading such high IDs.

What is the source of that ?

Sebastien-PILUSO commented 5 months ago

@jdcourcol I will try to do my best for well explaining the overflow issue on ITK-snap specifically. To complete the warning message illustration given by @eleftherioszisis, here is a concrete example. In the attached figure, I am pointing out a specific label which is Secondary motor area, layer 3. This particular label is concerned by the new ID generator from DKE given the fact we generated it in the atlas pipeline (layer2/3 split). In the hierarchy, the unique new ID attributed by the generator is 2511156654 for this region. However, when observing the data on ITK, you can clearly notice that this layer is labeled as being 13742, which does not exist in the hierarchy. ITK-snap does this because int16(2511156654) = 13742. When visualizing this new region we produced on ITK-snap, it is not possible to tell directly which region it refers to because of the overflow. In that particular view of the brain, I applied the Allen Institute color code. But as this particular region was not detected in the hierarchy, it was not possible to attribute its real color (which is green) because of the overflow. So a random color was given to this region.

Sebastien-PILUSO commented 5 months ago

currently, the id defined by DKE is a hash in that huge range . If we select a smaller range for the available value for that hash, the likelihood of a collision is higher

I don't see your point. The range has no relationship with the collision with potential new IDs. Basically, each number has an equal probability to be selected by the Allen Institute, from 1 to the infinite, as there is no given rule for choosing such figures. Choosing high numbers does not decrease the probability of collision.

jdcourcol commented 5 months ago

he range has no relationship with the collision with potential new IDs. Basically, each number has an equal probability to be selected by the Allen Institute, from 1 to the infinite, as there is no given rule for choosing such figures. Choosing high numbers does not decrease the probability of collision.

My point relates to the ids generated by DKE when we generate new regions. There is no collision check at the moment. Therefore the smallest the range, the highest the likeliness of collision between these ids.

Sebastien-PILUSO commented 5 months ago

What is the source of that ?

If you are talking about the Allen's ID range, it comes from the hierarchy file. The Allen Institute have not provided any further justification of how they chose their numbering system.

Sebastien-PILUSO commented 5 months ago

My point relates to the ids generated by DKE when we generate new regions. There is no collision check at the moment.

Ah ok I see your point now. The native hierarchy file from the Allen Institute gathers about 1000 ids. Storing those IDs in a list and add a simple collision check on this list would be very simple and appropriate. This won't be time consuming as the list is very short, and safer in all cases.

BlueBrain / atlas-splitter

Unique ID overflow #15