brainglobe / brainglobe-atlasapi

A lightweight python module to interact with atlases for systems neuroscience
https://brainglobe.info/documentation/brainglobe-atlasapi/index.html
BSD 3-Clause "New" or "Revised" License
124 stars 32 forks source link

BrainGlobe Atlas API Version 2 #141

Open adamltyson opened 1 year ago

adamltyson commented 1 year ago

This is essentially a reply to https://github.com/brainglobe/bg-atlasapi/issues/96, but I'm starting a new issue to track this idea. Sorry for the long post, but interested in your ideas, @brainglobe/maintainers @brainglobe/swc-neuroinformatics @brainglobe/czi_eoss5.

After 2.5 years, and based on conversations with various users and atlas creators, I think it's time for bg-atlasapi version 2. Version 1 works very well for "classical" anatomical atlases (i.e. one reference image, and one annotation), but it doesn't cater well (or at all) for:

The atlas generation process also needs streamlining.

My idea for V2:

Move away from the monolithic atlas structure

The atlas could be defined by a config file, specifying atlas "elements" (an element being reference image, set of meshes etc):

 name: "allen_mouse"
 atlas_link: "http://www.brain-map.org"
 version: 2.0
...
...
reference_images: 
 # could be >1
  STP:  some_url
annotation_images: 
 # could be >1
  CCFv3:  some_url
structures: some_url
meshes: some_url

When the atlas was downloaded, the atlas API would check to see which of these existing files had been downloaded, and then only download those required. The idea is that there would be a lot of overlap between atlases (same meshes at different resolutions, same reference image for multiple annotations etc.), and this would reduce download times and save disk space.

This would also allow data to be stored somewhere other than GIN. I'm not sure whether we want to do this, but it may be necessary for e.g. larger atlases (see below).

Improve versioning

Essentially as per https://github.com/brainglobe/bg-atlasapi/issues/96. We could version the elements individually, and a versioned atlas could specify these, e.g.:

 name: "allen_mouse"
 atlas_link: "http://www.brain-map.org"
 version: 1.0
...
...
reference_images: 
  STP:  some_url@v1.0.0
annotation_images: 
  CCFv3:  some_url@1.0.0
structures: some_url@1.0.0
meshes: some_url@1.0.0

Improve the atlas generation process

I think the PR to bg-atlasgen has worked ok, but the repo itself needs a lot of refactoring to improve it. Submitting a new atlas could become more complex though, if the user is supposed to select which pre-existing atlas elements can be re-used. We end up spending a lot of time on these pull requests, so maybe we could:

Introduce "relationships" between atlases

Lots of atlases are related in some way, e.g.:

It would be useful to introduce two concepts:

Include additional data

There are different types of atlas other than just brain regions (e.g. cell atlases). There is also a lot of publicly available data that is registered to an atlas (e.g. tracing data, gene expression). These data are as much an "atlas" as the brain region ones. I propose adding additional "elements" to cater to this. These elements could either be added to an existing atlas, or a new atlas could be created (without necessarily any annotation image, just the reference image to define the coordinate space). In some cases, this may include duplicating some functionality of morphapi. There are a lot of questions here about exactly what data to support and how to standardise it.

Questions

Do we want to support data stored elsewhere?

My gut feeling is that in general BrainGlobe should ensure the validity of all atlases. However, for some atlases (e.g. bigger ones) maybe we want to allow hosting of files elsewhere and maybe mark them with a community tag or similar? This could also simplify the support for lab/project-specific atlases that we may not want to become a "proper" BG atlas.

Should we support lazy loading for large atlases?

Some atlases are becoming very large (e.g. EM). We don't want to re-package these ourselves, and we definately don't want to download them locally. We could support N apis for lazy loading to support these type of atlases. I assume these atlases will only become more common, but we may not want to support them at all.

vigji commented 1 year ago

Ater 2.5 years, and based on conversations with various users and atlas creators, I think it's time for bg-atlasapi version 2. Version 1 works very well for "classical" anatomical atlases (i.e. one reference image, and one annotation), but it doesn't cater well (or at all) for:

  • Atlases with multiple reference images (e.g. kim_dev_mouse and mpin_zfish)
  • Atlases that exist at multiple resolutions (the atlas API doesn't link these in any way)
  • Atlases with the same annotations in different coordinate spaces (e.g. allen_mouse and perens_lsfm_mouse)
  • Atlases that aren't represented by brain regions (e.g cell atlases)
  • Atlases (or coordinate spaces) that incorporate other data (e.g. tracing, gene expression)
  • Atlases being updated (as discussed above)

The atlas generation process also needs streamlining.

This would be great! I completely agree, the monological atlas idea can only produce growing problems in the future, we do need more flexibility.

My idea for V2:

Move away from the monolithic atlas structure

The atlas could be defined by a config file, specifying atlas "elements" (an element being reference image, set of meshes etc):

 name: "allen_mouse"
 atlas_link: "http://www.brain-map.org"
 version: 2.0
...
...
reference_images: 
 # could be >1
  STP:  some_url
annotation_images: 
 # could be >1
  CCFv3:  some_url
structures: some_url
meshes: some_url

When the atlas was downloaded, the atlas API would check to see which of these existing files had been downloaded, and then only download those required. The idea is that there would be a lot of overlap between atlases (same meshes at different resolutions, same reference image for multiple annotations etc.), and this would reduce download times and save disk space.

I like this concept overall. If I get it correctly, the idea would be to have objects describing all parts of the atlas, where each can have independent versioning (as per later point) and storage place, as well as downloading time point. Difficulty here could only be the risk to overdo it, we'd have to find the soft spot in the tradeoff simplicity/flexibility. I still think minimal overhead over the data is a major strength of bg_atlasapi.

This would also allow data to be stored somewhere other than GIN. I'm not sure whether we want to do this, but it may be necessary for e.g. larger atlases (see below).

I think this is feasible, given some requirements over availability (e.g., host has to mint a valid DOI) and some validation tools (see below). Hosting-wise, I do not like GIN very much for the current limitations (eg, the zip file download suboptimality) , and I don't think there's any mayor ongoing developing effort. I feel that DVC could be something very interesting to consider here and for the versioning.

Improve versioning

Essentially as per #96. We could version the elements individually, and a versioned atlas could specify these, e.g.:

 name: "allen_mouse"
 atlas_link: "http://www.brain-map.org"
 version: 1.0
...
...
reference_images: 
  STP:  some_url@v1.0.0
annotation_images: 
  CCFv3:  some_url@1.0.0
structures: some_url@1.0.0
meshes: some_url@1.0.0

I agree, this would give lot of room for optimisation.

Improve the atlas generation process

I think the PR to bg-atlasgen has worked ok, but the repo itself needs a lot of refactoring to improve it. Submitting a new atlas could become more complex though, if the user is supposed to select which pre-existing atlas elements can be re-used. We end up spending a lot of time on these pull requests, so maybe we could:

  • Provide a form (possibly an issue template) that asks for all the relevant information.
  • Develop tooling to go from form to config file.
  • Submit the PR ourselves?
  • Develop tooling to rigorously check a new atlas against BrainGlobe tools (especially all parts of the API).

Agree, a form could be a great starting point and can happen even before the rest of the points just to get in touch!

Introduce "relationships" between atlases

Lots of atlases are related in some way, e.g.:

  • They are the same atlas at different resolutions
  • They are the same annotations las with different reference images
  • They are a version of another atlas in a different coordinate space

It would be useful to introduce two concepts:

  • Grouping (easy) - make it obvious in e.g. the CLI and website that different atlases are related
  • Transforms (hard) - store transforms from one atlas to another in a standardised form, and provide methods to transform between them. For the different resolutions, this is possible, but not all the other atlases have a transform between them. We would need transforms for various types of data (e.g. images & objects). This may end up being for V3.

This would be nice. The good thing is that while grouping would pertain to the atlas semantics, the transformation would not, so one could develop it totally independently from the new atlas structure just as an additional layer/tool.

Include additional data

There are different types of atlas other than just brain regions (e.g. cell atlases). There is also a lot of publicly available data that is registered to an atlas (e.g. tracing data, gene expression). These data are as much an "atlas" as the brain region ones. I propose adding additional "elements" to cater to this. These elements could either be added to an existing atlas, or a new atlas could be created (without necessarily any annotation image, just the reference image to define the coordinate space). In some cases, this may include duplicating some functionality of morphapi. There are a lot of questions here about exactly what data to support and how to standardise it.

Although nice in principle, this would stretch the normative effort a bit too much imo. It would be better to give people clear ways to autonomously distribute data in a BrainGlobe-compatible way without deciding too many constrains a priori.

Questions

Do we want to support data stored elsewhere?

My gut feeling is that in general BrainGlobe should ensure the validity of all atlases. However, for some atlases (e.g. bigger ones) maybe we want to allow hosting of files elsewhere and maybe mark them with a community tag or similar? This could also simplify the support for lab/project-specific atlases that we may not want to become a "proper" BG atlas.

I think that this could be possible to do, but keeping a very stringent criteria on what you said - ensuring validity. As per my point above, this would require: 1) DOIs to guarantee accessibility and 2) tools for runtime (or first-download time) validation, with solid fallback options if the validation of a new version fails (we do not want to have the API blaimed for atlas developer inconsistencies, that can happen).

Should we support lazy loading for large atlases?

Some atlases are becoming very large (e.g. EM). We don't want to re-package these ourselves, and we definately don't want to download them locally. We could support N apis for lazy loading to support these type of atlases. I assume these atlases will only become more common, but we may not want to support them at all.

I don't know how much of this is a priority, you have a better idea of the current situation in the community. If it is, maybe it is conceivable to allow people to specify a different data backend for an atlas, as long as it provides a numpy-like interface to fetch the data and we don't have to work out the details of it. I guess this is what you meant with support apis right?

adamltyson commented 1 year ago

@vigji thanks for your feedback!

I like this concept overall. If I get it correctly, the idea would be to have objects describing all parts of the atlas, where each can have independent versioning (as per later point) and storage place, as well as downloading time point. Difficulty here could only be the risk to overdo it, we'd have to find the soft spot in the tradeoff simplicity/flexibility. I still think minimal overhead over the data is a major strength of bg_atlasapi.

Yep. My idea was to split up the atlas into groups of files that are likely to be edited as one, namely:

As an example (mostly just because I'd already made the figure), two similar atlases (allen_mouse and kim_mouse) would share a reference image, but the other files would be unique. The diagram shows the files being defined by a URL, but I think they would have a unique identifier (e.g. allenccf_stpreference_10um@v1.3 ), the URLs would be stored elsewhere, and the API would check locally first before downloading (as is done at the level of the atlas currently). distributed_atlases

I think this is feasible, given some requirements over availability (e.g., host has to mint a valid DOI) and some validation tools (see below). Hosting-wise, I do not like GIN very much for the current limitations (eg, the zip file download suboptimality) , and I don't think there's any mayor ongoing developing effort. I feel that DVC could be something very interesting to consider here and for the versioning.

Yes I think we should probably "outsource" some of the versioning to the many tools that already do it so well with a git-like system. DataLad is an option too.

This would be nice. The good thing is that while grouping would pertain to the atlas semantics, the transformation would not, so one could develop it totally independently from the new atlas structure just as an additional layer/tool.

:+1:

Although nice in principle, this would stretch the normative effort a bit too much imo. It would be better to give people clear ways to autonomously distribute data in a BrainGlobe-compatible way without deciding too many constrains a priori.

This is definitely low priority and may not see the light of day. I like the idea of expanding the idea of atlases though. To me (as an example) connectivity between regions is as much of an "atlas" as labels associated with voxels.

I don't know how much of this is a priority, you have a better idea of the current situation in the community. If it is, maybe it is conceivable to allow people to specify a different data backend for an atlas, as long as it provides a numpy-like interface to fetch the data and we don't have to work out the details of it. I guess this is what you meant with support apis right?

Yep, for example OME-Zarr looks like it could be a common interface for this type of thing. The question would be how to deliver an atlas in chunked way. Again, low priority.

alessandrofelder commented 1 year ago

In general looks like a good start.

Some initial impressions (mostly questions) from someone with minimal experience working with Atlases (so far)

adamltyson commented 1 year ago

are we relying on the filename as the unique identifier for the local checking? This seems potentially brittle - a user misinterpreting an error message related to a filename may well rename a file on their local filesystem and end up in a weird state?

We do currently, but I'm looking for a better solution for v2 if possible. TBH I don't think the current solution has caused any problems (yet), but it could be improved.

if (e.g.) an annotation image only gets changed from one version to another, would we bump the version only on the annotation, or on everything related?

My idea is that if any constituent component of an atlas (e.g. annotation) changes, then that would trigger the generation of a new version of the atlas as a whole. However, if we use semantic versioning, it could be a patch/minor/major change.

as already discussed this seems tricky at first glance. Can't tell a priori how one would write some validation tests (to run on PR?) that could give us the guarantees we're after? (and actually, what guarantees are we after? Are these the thing that is hard to define/find the right level of stringency?)

I think there's a lot of things we won't be able to test in CI, as it's going to take too long. We could have a stand-alone "BrainGlobe Atlas Validator" tool though. I think this is going to be one of those things where the bulk of the value would come from testing a small number of things (are URLs valid, is the data the right size/shape etc). Are there the same number of meshes as brain regions etc etc.

Are there other examples of curated atlas collections somewhere that we could draw from?

There are a few with different levels of "curation". Some initiatives I know of:

adamltyson commented 1 year ago

Some ideas of where to start (e.g. what could be v1.5)

adamltyson commented 2 months ago

Resurrecting this. Even in the last year, the complexity of available atlases has increased considerably. There are multiple:

So while I think the general principle (store files separately & define via config) is a good one, we need to decide on what is an atlas? Certainly there should be some merging of what we currently refer to as an atlas (e.g. resolutions should be a parameter of a single atlas), we will need to decide on what is an atlas. Is it a coordinate space, an annotation etc? Do we stick with the atlases as published/released, or make our own "mega atlas" that combines data in the same space from multiple sources? This seems difficult for the community to adopt/report.

adamltyson commented 1 month ago

Following discussion with @PolarBean, @aeidi89, @alessandrofelder, @IgorTatarnikov, @niksirbi and others, it seems as if the most promising way forward is to basically give up on the idea of defining "an atlas" in the rigid way we have been trying to do. Instead I propose we define an atlas as something like:

"The collection of reference neuroanatomical data and metadata used in a specific analysis workflow".

This means that an atlas is the collection of files used by the researcher, and could be unique to their study. In this comment I will summarise my understanding of the best way forward:

General concept

The way that I imagine the users accessing the data they need would essentially be a decision tree, starting with species:

I'm not sure of the order of these necessarily, and of course only the relevant "decisions" should be presented to the user:

The API should support the access of these files individually (e.g. query for a specific annotation image) and together (define an atlas, and return a BrainGlobe atlas object). The second would allow backwards compatibility.

Hosting

To enable this "mix and match" approach, as outlined above, data should not be packaged up into an "atlas", but be hosted separately. We would need to decide how to store the meshes because we probably don't want to store them all individually (there will be hundreds of thousands eventually). However, there is a lot of overlap between the meshes for multiple annotation sets. Maybe we could store them in "batches" of ~10 meshes? This would make the API more complicated however.

Metadata

Unlike the proposed solution above where an atlas is defined by a config file, users would define what an atlas is for themselves. However, would we still want to define for the user some pre-set "standard" combinations (e.g. the Allen STPT atlas or the Waxholm MRI)?

We will need metadata to define the atlas components (as defined above, but using openMINDS_SANDS (see https://github.com/brainglobe/brainglobe-atlasapi/issues/356).

We will also need metadata to define the "tree" above, i.e. these reference images are in this coordinate space.

Versioning

As above, every element should have it's own version

Credit/reproducibility

To ensure appropriate credit for those who create these resources and to ensure reproducibility, the API should enable user-facing tools to create:

Coordinate spaces

Currently in BrainGlobe, we have different resolutions of atlases. I think should be preserved for registration, but all results should be defined in the coordinate space, defined in physical units, not voxels. I.e. this means that data registered to two resolutions of the same reference image, should produce (approximately) the same result.

Mapping between coordinate spaces

A logical conclusion of this approach is to allow data to be moved between coordinate systems. I propose this should be shelved until version 3.

Hosting

It should be possible to host the data in multiple places (i.e. mirrors) to optimise performance and reduce downtime (cc @dbirman)

I'm sure I've overlooked many elements, so anyone feel free to chime in.

dbirman commented 1 month ago

+1 for coordinate spaces in physical units, that's a known big source of confusion/frustration!

imagesc-bot commented 1 week ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/mri-template-for-ccfv3-space/101219/12