PavlidisLab / Gemma

Genomics data re-analysis
Apache License 2.0
23 stars 6 forks source link

Restore/improve support for non-GEO dataset upload (esp. single-cell) #1284

Open ppavlidis opened 3 weeks ago

ppavlidis commented 3 weeks ago

There are many single-cell data sets that we want to load into Gemma that are not available in GEO. These typically come from random web sites, not a particular repository.

We have some antiquated support for this already, both from the CLI and GUI, but it needs to be revisited and probably updated.

https://gemma.msl.ubc.ca/expressionExperiment/upload.html (ExpressionDataFileUploadController) LoadSimpleExpressionDataCli

These were designed with microarrays in mind, and for data that comes as a single tab-delimited file.

Also note we have methods for loading experimental design information from files as well (ExperimentalDesignImporter) but it is limited too. For uploading meta-data on samples we'll need something like this.

We'll need to adapt these to facilitate loading of single-cell data.

In general, there are three steps, after which datasets should be able to be processed "as usual".

  1. Definition of the basic data set information (name, description etc.) - the upload form is not a bad way to do this but it will need to be updated a little. Probably the uploading of data itself should be separated from that step completely.
  2. Loading of data files, and probably supporting some other format besides tsv (we need to see what makes sense). Since we support this already for single-cell, this part should be easy.
  3. Uploading of meta-data on samples if available, to save data entry time.

We'll flesh this out with some particular examples.

arteymix commented 3 weeks ago

This needs to wait until we finish the basic single cell support.

We should have all the necessary software components for this, they just need to be assembled in a CLI tool.