Closed wizmer closed 3 years ago
It looks really good, thanks @wizmer ! Now I'm wondering what is the best way to create a neurondb from scratch with this API? I will try to play with it in the context if MCAR, meaning I am given a bunch a morphollogies with some metadata, and I need to create a morph release, following curation, annotation, repair steps, and using all our codes. So I would need to create/edit the MorphDB along the way, to end up with final one at some point. We also need to think of how to easily allow for other means of saving the DB, with nexus for example.
Did you test it on folders with many morphologies? I am wondering how it scales for millions of morphologies.
I was thinking about that, too @adrien-berchet , but I think we should use this API for small number of morphs, and use other tools for large numbers. This will be used for morphology release from biological reconstructions, and lab can reconstruct 9M morphs yet, so we are fine. For full synthesis of 9M cells, we won't really need this API, but rather the sonata format in h5, or other ones.
Ah ok. If we are sure we will not need it for millions of cells it's cool.
Did you test it on folders with many morphologies? I am wondering how it scales for millions of morphologies.
No I did not but I don't think there should be any bottleneck. Reading from a DAT file is simply a call to pd.read_csv
and the other expensive thing is to find the paths. But listing the content of the morphology folder can not really be avoided so I don't see much room for improvement.
Finally I did not add the extension back because it should no longer affect the performances. I have changed the way of looking upon them.
No I did not but I don't think there should be any bottleneck. Reading from a DAT file is simply a call to
pd.read_csv
and the other expensive thing is to find the paths. But listing the content of the morphology folder can not really be avoided so I don't see much room for improvement.
I was more concerned about XML parsing. But maybe this is only used for biological data, so small data sets. But we use it for the clones, right? These are quite big data sets. Though it should no more be used soon.
In the morphlology repair workflow we are already using a similar parser, but it's true that we've never used it above a few 100 000 cells. We can always improve the perf in the future if we see that it's not enough.
:+1:
So with the from_folder
, we can convert any other format of morphology db to this one, right? For example, if I just have a list of morphs in a .dat, I can create one, provided I give it a list of mtypes as well.
This is good, it makes it very flexible at the user end.
Yes, or you can also use the constructor
db = MorphDB(MorphInfo(*data) for data in my_data_seq)
ah yes, sure! excellent! In my end, I think it's good to use, I've been trying it here and there. I'll be keen to merge it, so I can do some massive cleanup in our codes ;)
Thank you !
🍾 !!!
It introduces two classes:
MorphInfo
A class containing all the information that can be found in the neurondb regarding a single morphology.
MorphDB
A class containing information regarding collections of morphologies. The class has a single attribute
self.df
to expose this information as a Pandas dataframe.Constructors:
Adding new data
Data can be added through the the
+
and+=
operators.Example:
Writing neurondb to disk.
The data can be written to disk with both the XML and DAT format.
Analysing morph-stats features
NeuroM morphometrics can be extracted with
self.features(config)
whereconfig
is a morph-stats configuration.Labels can be used to compare morphometrics across different datasets.