BlueBrain / morph-tool

https://morph-tool.readthedocs.org/
GNU Lesser General Public License v3.0
8 stars 7 forks source link

Add an I/O neurondb API #25

Closed wizmer closed 3 years ago

wizmer commented 4 years ago

It introduces two classes:

MorphInfo

A class containing all the information that can be found in the neurondb regarding a single morphology.


info = MorphInfo(name='name', mtype='mtype', 'layer'='layer')
info.use_dendrite = False
info.axon_inputs = ['morph1', 'morph2']

MorphDB

A class containing information regarding collections of morphologies. The class has a single attribute self.df to expose this information as a Pandas dataframe.

Constructors:

MorphDB()
MorphDB.from_neurondb()
MorphDB.from_folder()

Adding new data

Data can be added through the the + and += operators.

Example:

all_morphs = MorphDB.from_neurondb('/gpfs/.../unravelled', label='unravelled')
all_morphs += MorphDB.from_neurondb('/gpfs/.../repaired', label='repaired')

more_morphs = MorphDB([MorphInfo(name='name', mtype='mtype', 'layer'='layer', label='extra-morphs'),
                       MorphInfo(name='name', mtype='mtype', 'layer'='layer', label='extra-morphs'),
                       MorphInfo(name='name', mtype='mtype', 'layer'='layer', label='extra-morphs'),
                       MorphInfo(name='name', mtype='mtype', 'layer'='layer', label='extra-morphs')])
morph2 = all_morphs + morph_info_list

Writing neurondb to disk.

The data can be written to disk with both the XML and DAT format.

MorphDB.from_neurondb('path1').write('neurondb.dat')

Analysing morph-stats features

NeuroM morphometrics can be extracted with self.features(config) where config is a morph-stats configuration.

Labels can be used to compare morphometrics across different datasets.

db = MorphDB.from_neurondb('path1', label='dataset-1')
db += MorphDB.from_neurondb('path2', label='dataset-2')
db += MorphDB.from_neurondb('path3', label='dataset-3')
db += MorphDB.from_neurondb('path4', label='dataset-4')

features = db.features(config)

for dataset, df in features.groupby(('neuron', 'label')):
    print(dataset, df)
arnaudon commented 3 years ago

It looks really good, thanks @wizmer ! Now I'm wondering what is the best way to create a neurondb from scratch with this API? I will try to play with it in the context if MCAR, meaning I am given a bunch a morphollogies with some metadata, and I need to create a morph release, following curation, annotation, repair steps, and using all our codes. So I would need to create/edit the MorphDB along the way, to end up with final one at some point. We also need to think of how to easily allow for other means of saving the DB, with nexus for example.

adrien-berchet commented 3 years ago

Did you test it on folders with many morphologies? I am wondering how it scales for millions of morphologies.

arnaudon commented 3 years ago

I was thinking about that, too @adrien-berchet , but I think we should use this API for small number of morphs, and use other tools for large numbers. This will be used for morphology release from biological reconstructions, and lab can reconstruct 9M morphs yet, so we are fine. For full synthesis of 9M cells, we won't really need this API, but rather the sonata format in h5, or other ones.

adrien-berchet commented 3 years ago

Ah ok. If we are sure we will not need it for millions of cells it's cool.

wizmer commented 3 years ago

Did you test it on folders with many morphologies? I am wondering how it scales for millions of morphologies.

No I did not but I don't think there should be any bottleneck. Reading from a DAT file is simply a call to pd.read_csv and the other expensive thing is to find the paths. But listing the content of the morphology folder can not really be avoided so I don't see much room for improvement.

Finally I did not add the extension back because it should no longer affect the performances. I have changed the way of looking upon them.

adrien-berchet commented 3 years ago

No I did not but I don't think there should be any bottleneck. Reading from a DAT file is simply a call to pd.read_csv and the other expensive thing is to find the paths. But listing the content of the morphology folder can not really be avoided so I don't see much room for improvement.

I was more concerned about XML parsing. But maybe this is only used for biological data, so small data sets. But we use it for the clones, right? These are quite big data sets. Though it should no more be used soon.

wizmer commented 3 years ago

In the morphlology repair workflow we are already using a similar parser, but it's true that we've never used it above a few 100 000 cells. We can always improve the perf in the future if we see that it's not enough.

adrien-berchet commented 3 years ago

:+1:

arnaudon commented 3 years ago

So with the from_folder, we can convert any other format of morphology db to this one, right? For example, if I just have a list of morphs in a .dat, I can create one, provided I give it a list of mtypes as well. This is good, it makes it very flexible at the user end.

wizmer commented 3 years ago

Yes, or you can also use the constructor

db = MorphDB(MorphInfo(*data) for data in my_data_seq)
arnaudon commented 3 years ago

ah yes, sure! excellent! In my end, I think it's good to use, I've been trying it here and there. I'll be keen to merge it, so I can do some massive cleanup in our codes ;)

wizmer commented 3 years ago

Thank you !

arnaudon commented 3 years ago

🍾 !!!

wizmer commented 3 years ago

https://morph-tool.readthedocs.io/en/stable/morphdb.html