Reference data set for automated build testing

SciLifeLab / NGI-CellRaiser

Pipeline for analysing 10X single cell data

MIT License

8 stars 4 forks source link

Reference data set for automated build testing #9

Open sven1103 opened 6 years ago

sven1103 commented 6 years ago

Hey guys,

maybe it is to early for that, but I was thinking about which reference data sets to use for pipeline evaluation and automated build testing.

We can use this thread to collect ideas :)

subwaystation commented 6 years ago

I think at one point we want to compare our results with the output of the CellRanger pipeline, right? So would it make sense to include some datasets from 10X (https://support.10xgenomics.com/single-cell-gene-expression/datasets)? So we would have a first direct comparison. But we should include other datasets, as well.

wikiselev commented 6 years ago

I was actually wondering whether we go with the CellRanger or do it in a different way?

ewels commented 6 years ago

It's never too early! It's super useful to have these for the early development work I've found. If possible it's best to find something from yeast / an organism with a small reference genome, to keep the filesize small. Otherwise we'll need to mess around subsampling the data to a single chromosome or something to make the tests run quickly (possible, but a faff).

ewels commented 6 years ago

@wikiselev - as to which tools to use, probably nice to create a separate issue for that. But also check out ideas.md if you haven't already. I think it was @subwaystation's idea that we'd want to compare output to cellranger, not necessarily run cellranger.

Phil

sven1103 commented 6 years ago

@wikiselev - Currently, we would not just "rebuild" CellRanger. I would rather regard it as a reference pipeline, but we are free to build it different, dependent on what we will find out the next few weeks. I think we should probably schedule a new hangout call for the further discussion :)

wikiselev commented 6 years ago

I feel that CellRanger is quite in use and demand by lots of users, therefore rebuilding makes sense to start with it. Also keeping in mind that it's 10X own solution I doubt we can do significantly better.

apeltzer commented 6 years ago

I agree - we should start with CellRanger and then improve upon that once we have something working reasonably well.

sven1103 commented 6 years ago

@wikiselev I mean, don't get me wrong, CellRanger might be a good customised solution. Imho the first goal would be to put it in a Scientific WF Framework, including stable environments for the tools with Singularity as container solution. And give the community the possibility to easily install and run it on any cluster plus have it reproducible.

Modularity of the tools should enhance the possibility to customize the pipeline (e.g. different mapper, etc). Moreover, this would be a good basis for future benchmarks of the pipeline.

I do not completely agree with the performance. For example the duplicate removal step. I would really like to see the performance differences between different tools here, as this is a crucial step :)