emanuega / MERlin

MERlin is an extensible analysis pipeline applied to decoding MERFISH data
MIT License
35 stars 29 forks source link

Adding in clustering #20

Open seichhorn opened 5 years ago

seichhorn commented 5 years ago

I was gearing up to update the clustering code I and others use to make it 1) compatible with the most recent stable release of scanpy instead of the development version I was originally wrapping and 2) integrated into merlin. In thinking about it though, it didn't fit as cleanly into the merlin framework as I was originally thinking, largely because the clustering will often be performed on several datasets.

I felt like the two cleanest options were to make a "metaanalysis" class in merlin that takes in many datasets and performs analysis tasks on the aggregated data, and clustering would be one such analysis. The other was to just not integrate the clustering code and instead just make it easy to port the data from one to the other. Do you have any thoughts on this?

I feel like the metaanalysis class would only be worth it if we were going to use it for more than only clustering analyses. It's also possible to just let the clustering be a normal analysis task, tie it to a particular dataset, but let the user pass in multiple datasets via parameters. This seemed like something you wouldn't like, and I don't really favor it.

emanuega commented 5 years ago

I anticipate that all MERFISH experiments will have to take into account more than a single measurement so the ability to perform meta analysis on multiple datasets is useful to incorporate into MERlin.

I do prefer the first option, but rather than a metaanalysis class, I would prefer a metadataset class (such as MetaMERFISHDataSet) that organizes multiple datasets and saves all the analysis performed on the metadataset into the corresponding metadataset directory. The metadataset can still be a subclass of the dataset class but instead of using the name of the raw data folder, a name will have to be specified for the metadataset. With this, very little would have to change in the analysis task structure and they should even be able to be executed in the nearly same way that analysis tasks are currently run on a MERFISHDataSet with an appropriate analysis parameters file. The tasks that run on a MetaMERFISHDataSet could be placed in merlin.metaanalysis instead of merlin.analysis to help distinguish them from the analysis tasks that run on a MERFISHDataSet and we could add in a check within each analysis task to make sure the right kind of DataSet is passed. The CLI will have to be updated so that MetaMERFISHDataSets can be created since currently only one dataset can be specified.

Perhaps each dataset within the metadataset can also include a label indicating the group the dataset belongs to, such as different experimental conditions, so that the analysis tasks that operate on the metadataset can analyze gene expression differences between the different conditions.

seichhorn commented 5 years ago

Yeah this works for me conceptually, I'll start working on the MetaMERFISHDataSets class now.

seichhorn commented 4 years ago

@emanuega to close out this long-running issue, I spent a while implementing the metaMERFISHDataSet class and clustering metaanalyses in the mercluster branch, but ultimately felt like these contributions were decreasing the clarity/quality of MERlin because of the extra baggage that had to come along with the changes to support these functions. I decided I would leave MERlin alone in this regard and implement the MERlin architecture to support just these types of metaanalysis tasks in a separate project, which is now in a functional but early stage in my MERCluster repo. If in the end you want to put something like this into MERlin I'm happy to help. A separate thought is that if we were going to extend MERlin it might be more generally useful add some additional visualization features so people can better interact with the data, this type of analysis tend to be more generally of interest from others in the lab.

One missing piece is that I think MERlin would benefit from a final data-aggregation task to merge your exported barcodes, sequential genes, etc into a single, normalized output file. I started writing that and at some point will issue a PR with it for MERlin.