broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
285 stars 52 forks source link

Create a multiqc module #257

Open grst opened 1 year ago

grst commented 1 year ago

It would be great to have a multiqc module for cellbender that aggregates plots and metrics from individual cellbender runs into a single HTML report.

This would be particularly useful when processing many samples: Instead of looking at each HTML report individually, one can quickly spot those where cellbender didn't converge or failed in any other regard.

sjfleming commented 1 year ago

This is a really nice idea...

I would be open to community contributions on this! :)

It might be the case that you could make such a report based on the individual metrics.csv files from each sample. That is at least what I intended as the use for those metrics files. If they lack information that would be important for creating such a report, it might be a good idea to include additional information in the metrics.csv

grst commented 1 year ago

Hi @sjfleming,

I'm not promising anything, but if it turns out that we end up using cellbender as a routine step in our single-cell processing pipeline, I might be able to justify spending time on this.

In any case, they metrics.csv is a good start, but what I'd really like to have the the reports as well is the training plot for each sample

and potentially a simplified version of the cell probability plot:

I don't think these information are currently made available in any machine-readable format. MultiQC reports usually also contain an overview of QC warnings/failures per sample. It would be nice to not have to parse them from the HTML report.

I'm not sure what would be the natural way to include them in a metrics.csv... Maybe a metrics.json or metrics.yml would be more natural?

{
  "total_raw_counts": 10121979,
  [...],
  "training_progress": {
    "train": {
      "x": [0, 10, 15, 27 ... ],
      "y": [...]
    },
    "test": { ... },
   },
  "cell_probability": {
      "x": [...],
      "y": [...],
   },
   "warnings": [
      {
         "id": "learning_curve_didnt_converge",
         "description": "The learning curve didn't converge... Please check ..."
      }
   ]
}
sjfleming commented 1 year ago

I would certainly be open to the idea of a metrics.json