galaxyproject / tools-iuc

Tool Shed repositories maintained by the Intergalactic Utilities Commission
https://galaxyproject.org/iuc
MIT License
161 stars 418 forks source link

scRNA-Seq Workflows #2057

Open mblue9 opened 5 years ago

mblue9 commented 5 years ago

This issue is a place to discuss how to create and collaborate on flexible scRNA-Seq workflows.

I've tried to collate the main points of the discussion that started here and have added updates from the discussion below.

Feel free to edit!

ping @bgruening @mtekman @blankenberg @pcm32 @pinin4fjords

Intermediate format(s)

Structure of tools

Creating wrappers

pinin4fjords commented 5 years ago

Brilliant, thanks @mblue9, good summing up!

pcm32 commented 5 years ago

This is an excellent sum up of our discussion at the other side. I would only go a bit further in the discussion regarding the granularity of the tools.

While I see the usability benefits of having a single tool for, say filtering, where you can choose from a drop down the different providers of that filtering (seurat, scater, etc), I think that we are a bit constrained by Galaxy here in the one tool one container assignment. Given that technical restriction, I would rather go for separate filtering tools, where all of them should have the same input/output interfaces, or mostly similar, for the different tool providers. Then it means that Galaxy tool filtering-seurat can use the r-seurat-scripts conda package/container and filtering-scater Galaxy tools uses r-scater-scripts conda package/container. Otherwise, it would break our microservices approach in that we would need a monolithic container holding all the tools together (suerat, scater, etc), which would make tool maintenance/versioning more complicated.

bgruening commented 5 years ago

Let's not limit our-self doing the right thing by technical obstacles - if this is the case let's fix the technical limitations :) Btw. this is not a problem. We took care of this by using https://github.com/BioContainers/multi-package-containers. This repo is creating a container for every Galaxy tool here in IUC, no matter how many (>1) dependencies it has.

pcm32 commented 5 years ago

Yes, but I would rather avoid the multi package containers, because again it means a monolithic container which makes it heavier to move around, more complicated to maintain, etc.

Unfortunately, we don't have the timeframes in our current funding to wait for those technical obstacles being sorted (and properly production tested) in Galaxy (nor the man power to do it), we need to have something workable and demo-able soon (we have been working on this part time for a few months now, we have proposed this solution inside participating consortium, got oks, etc). Changes like this as well would mean changes in some Galaxy runners and destinations handling which sometimes make this assumption (one tool one container), and then all of these would need to be properly tested, etc.

Once those technical obstacles are sorted, we could certainly merge modules without much issue if the interfaces are adequately compatible, but I wouldn't want to put these efforts on hold now due to that. Hence, I would hope this is a reasonable compromise (functionality-tool wrapper initially) that would allow us all to work together towards the same goal immediately.

Additionally, I see the functionality-tool wrapper to be more easily maintainable (ie, changes for one tool don't risk breaking all the other tools being wrapped), but that is a matter of taste I guess.

bgruening commented 5 years ago

means a monolithic container which makes it heavier to move around, more complicated to maintain, etc.

Can you elaborate on this? Galaxy is taking care of all this. There is nothing you need to do. We are running those containers - also the multiple-tool-ones here in Freiburg. Have a look at https://quay.io/organization/biocontainers for mulled- containers. They are pretty standard and many IUC tools are using them already.

pcm32 commented 5 years ago

Well, I guess that if we would do multi-tool containers out of the *-scripts packages, so that most of the scripting logic being used still remains in the container, this is something we can live with. It breaks a bit our model when using it in other workflow environments though, as it not so easy to replicate that there. That said, I think that the technology flows towards microservices-based containers (one container one tool, in our case), not agglomerated containers. This also means that if the amount of tools being used grows largely, the containers that you will have to move around will be heavy, specially if some tools drag too many dependencies. If I need to use a single tool, why should I be moving in and out 10 or 15 other tools inside a monolithic container? Also remember that on cloud providers you pay for data transfers in and out. This would make every pull/tool execution more expensive.

My other concern is that it makes the development process slower. Using a container that comes out of bioconda directly it means today that we can still have that container quickly fixed locally if needed, as we can replicate the bioconda to container process easily locally. However, if when developing the wrappers using the multi-containers we realise there are errors that need changes in the containers, we need to make the change on bioconda, pull request that, wait for that to be merged, and then be able to get a new multi-tool container. This wait process would make it too problematic for us. We have replicated the bioconda to containers process locally to alleviate it in the direct case, but we don't have a way of alleviating the non-direct case.

pcm32 commented 5 years ago

But, to sum it up, from the excellent summary that @mblue9 set up, I would only ask that we do individual functionality-tool wrappers (that follow the same interface per functionality and then they could be easily merged) instead a single multi-tool functionality wrapper. I hope that this is a bearable compromise.

pcm32 commented 5 years ago

We had a chat with @bgruening regarding other topics, and he clarified to me some concerns I had regarding the multi-tool container generation process. So in that sense, I'm happy to go with single functionality (multi-tool) wrappers, given that all execution scripts will still be in the *-scripts packages that we put in the multi-tool container (and hence can be reutilised easily for other purposes). I was concerned that we would need to enact changes in the Galaxy runners (Galaxy core code) to have some logic defining which container would be used by a tool given certain selections in the wrapper (but this is circumvented by the multi-tool container being a single container).

In line with that, my concerns would be mostly at the usability and wrapper development level. Please consider the following that I wrote to Bjoern in our recent conversation:

On this regard, I agree that having a core functionality merging different tools is the best in terms of readability of the workflow. I wonder though if this might not complicate usability. For instance, when it runs the filtering step with seurat the user might need to input certain parameters that are unique to seurat filtering. Yet, when the user changes the tool in the dropdown to scater (in the filtering module) for instance, it might be that scater will have other parameters that need to be set and cannot use the ones provided by seurat (filtering), although the user might think that he/she is done with setting parameters. To advanced users it might be perfectly clear that new parameters need to be set, but I presume that this might complicate less experienced users, specially those with no “tool running in the cli” experience. This I presume in turn could also complicate tool development (as you would have plenty of conditionals to both use and display parameters).

What do you think?

mblue9 commented 5 years ago

Personally I wasn't feeling convinced that multi-tool is the way to go here and I feel even less convinced now. But I'm open to having my opinion changed.

The reasons I don't like the sound of it at this point are:

I was already wondering about what you've just asked @pcm32 . How to make the all-in-one wrappers without it becoming complicated/messy? I would prefer if @bgruening, @mtekman or someone who can imagine it better writes the first multi-tool wrapper here to show how it would work.

I also don't like the idea of the point you raised above, that this could force people to install tools they don't want. The seurat env is ~1GB. If someone only wants scater will they be forced to install seurat too, and what if more tools are added later. I would prefer for multi-tools to be enabled only if admins can choose what tools they want to install, and where they can choose to install just one if that's all they want. Or have I missed understanding that that's not an issue.

Could multi-tools potentially be more confusing for users. Maybe this is a minor concern, but looking at the current seurat and scater functions below that are in the *-scripts, would their normalise functions be in a tool called e.g. "scNormalisation", and then their other functions would be separate, with tool-specific names e.g. "Seurat Find Variable Genes". For users who come looking to use seurat or scater specifically I wonder if it will be confusing to have some functions in generically named multi-tool bundles and others not.

I'm also not convinced yet that multi-tools are really needed by users for this and worth the effort. I feel like it might over-complicate things unnecessarily. My preference would be for tools to have just a label/hashstag/category for e.g scNormalisation that users can search on and pick the one they want, without any need for multi-tool wrappers. I see it as more important that there are examples in the trainings and workflows that show users what tools to use, rather than trying to help them out with complicated multi-tool bundles.

But I'm happy to be convinced if someone else will write the first multi-tool wrapper and demonstrate the benefit of the multi-tool approach here. In the meantime my preference would be to work on the "normal" single tool wrappers that aren't shared between the tools (e.g. seurat-find-variable-genes.R)

Common Functions seurat-get-random-genes.R seurat-normalise-data.R seurat-read-10x.R

scater-get-random-genes.R scater-normalize.R scater-read-10x-results.R

Seurat-Specific Functions seurat-create-seurat-object.R seurat-dim-plot.R seurat-filter-cells.R seurat-find-clusters.R seurat-find-markers.R seurat-find-variable-genes.R seurat-run-pca.R seurat-run-tsne.R seurat-scale-data.R

Scater-Specific Functions scater-calculate-cpm.R scater-calculate-qc-metrics.R

bgruening commented 5 years ago

My conclusion from this great discussion is that UX counts and we should figure out what makes more sense with some prototypes and an iterative process. UX can mean multiple things, e.g. do not over complicate tools for users, but also do not over the same functionality in different/multiple tools if it can be avoided. Let's see how this works out :) A first step probably is to figure out if we can use loom to interchange the matrices.

mtekman commented 5 years ago

So when I was in the process of wrapping RaceID and Scater I saw the same incompatibility in certain methods, but I also realised that those same methods would not be so commonly used by users... who in principle should want to have a highly detailed and configurable analysis, but in practice will likely just want a tool that gives them a rough idea of what their clusters look like.

They can take the output R object and do a more detailed analysis on their own in R notebooks or even more specialised Galaxy tools.

I think the four main stages should be good enough for this rough purpose at least.

On Sat, 1 Sep 2018, 01:12 Björn Grüning, notifications@github.com wrote:

My conclusion from this great discussion is that UX counts and we should figure out what makes more sense with some prototypes and an iterative process. UX can mean multiple things, e.g. do not over complicate tools for users, but also do not over the same functionality in different/multiple tools if it can be avoided. Let's see how this works out :) A first step probably is to figure out if we can use loom to interchange the matrices.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproject/tools-iuc/issues/2057#issuecomment-417811126, or mute the thread https://github.com/notifications/unsubscribe-auth/ATr2ehNqn-B-H1N83L14Vmc2qzfGR-Wzks5uWcLPgaJpZM4WUl5D .

pinin4fjords commented 5 years ago

Hi all- back in the office today, great to see this continued discussion.

My strong preference is for single wrappers around single steps of single tools, tagged, grouped and described such that users can easily select from functionally related tools for a given function (e.g. filtering, normalisation etc) in an informed way. I think that gets us at least some of the way to the benefits of a multi-tool from a UX perspective and allows admins to choose what packages to provide.

As others have alluded to I think implementing a multi-tool with option sets from multiple packages, and maintaining it through releases would be a headache, as would the requirements it would impose on software installation. I also suspect simpler wrappers will make for simpler workflow description and facilitate automated setup and execution.

But I also think that it wouldn't be the end of the world to have two sets of wrappers in toolsheds tagged appropriately ('simplified' vs 'advanced' or something) if we have very diverging views on this. They could still call the same underlying wrapper scripts.

The priority in my view is to get intermediate cross-language intermediate formats to work so that e.g. Seurat can interact with Scanpy. That would be very cool, whether via an hdf5 flavour like loom or something else.

ethering commented 5 years ago

Hi. I've been reading this thread with interest and it's great to see so many people interested in implementing single-cell tools in Galaxy. I've recently been chatting with @mtekman wrt to single-cell tool development. From my side, I'm more interested in using Galaxy sc-RNA tools for training, and it's usually for people new to the single cell area. So, from that standpoint, I'm more interested in developing a few tools that I can take users through and explain what they do, similar to the 'simplified' tools that @pinin4fjords mentions above. @mtekman and others has mentioned having Filtering, Normalisation, Confounder Removal, and Clustering tools, which would be perfect for someone like me. I'm happy to 'force' some methods into tools (e.g. for filtering, using the runPCA method in scater and removing outliers), so that the running of them is simple, but the user can still understand the rationale behind what they are doing. I've also been writing some R code that implements the SingleCellExperiment and scater classes, similar to https://github.com/ebi-gene-expression-group/bioconductor-scater-scripts but not so eloquent! Anyway, I've some time to work on this (although I need something useable for a course in the first week of December), and I'm more than happy to go with the flow and contribute as much as possible.

pcm32 commented 5 years ago

It is great that this is gaining such momentum. Maybe it would be useful to have a call where we can decide what to do, among others, in terms of:

I've set up this doodle poll so that we can agree on a time: https://doodle.com/poll/w8d4vhy3ikd4p964

pcm32 commented 5 years ago

This might come in handy re formats discussion. People from the Human Cell Atlas have been compiling some stuff here (our concerns at least with intermediate formats is mostly for the matrix + metadata representation): https://github.com/HumanCellAtlas/table-testing

mblue9 commented 5 years ago

I don't know much about the formats yet and would be happy to try loom or whatever you all think we should.

For the wrappers, I had a go at wrapping the seurat-scripts (one wrapper per core function) to see what it would look like, and below is how it currently looks in the tool panel. Wondering what do people think? Does this look like too many or ok?

The wrappers are here if you want a look (not all options have been added) https://github.com/mblue9/tools-iuc/tree/seurat_scripts/tools/seurat_scripts

screen shot 2018-09-05 at 9 54 56 pm
pinin4fjords commented 5 years ago

That's friggin' awesome @mblue9 , quick work!

FYI everyone: bioconductor-singlecellexperiment-scripts and bioconductor-scater-scripts Bioconda packages now created.

pinin4fjords commented 5 years ago

P.S. anyone who wants more R functions wrapped in any of the Bioconda packages mentioned, feel free to fork the associated source repos (e.g. https://github.com/ebi-gene-expression-group/r-seurat-scripts) and make PRs. Follow existing examples or contribution guidelines we've put here: https://tertiary-workflows-docs.readthedocs.io/en/latest/scripts_for_r_packages.html.

mblue9 commented 5 years ago

Ok great, thanks for the nice feedback! :smile:

One thing I'm wondering is, so far these tools pass around seurat RDS objects, but for Galaxy users being able to visualise what's happening, without having to resort to visualising the RDS object in R, is important I think (many of them can't/don't use R).

So I'm currently wondering if I should add the visualisation functions below to r-seurat-scripts?

Or if instead, would it be better to create a completely separate Galaxy tool for visualisation e.g. "Seurat visualise"? Maybe one that combines the seurat visualisation functions into one tool with options where the user can select which to run? so that the tool panel isn't overwhelming e.g. with a tool for each of these functions:

VlnPlot GenePlot VizPCA PCAPlot PCHeatmap PCElbowPlot TSNEPlot FeaturePlot DoHeatmap

Or what do people think?

pinin4fjords commented 5 years ago

Firstly, it would be great if you could add the tools to r-seurat scripts. They weren't an immediate priority for us, but we'd have got there eventually. PRs would be very welcome.

I think the second point is distinct from that. As with the other 'multi tool' approach discussed, it would be really nice for consistency if a 'visualisation multi tool' simply picked which of the individual visualisation r-seurat-scripts to call. If others (possibly us) need more discrete tools for workflow steps, they can just use the same bioconda script with a simpler Galaxy wrapper.

I think this would present similar development problems to previous discussion- you'd have to find a way of cramming all the diverse options from the different visualisation methods into a single tool. But if that's easy to do in a UI-friendly way it could work.

bgruening commented 5 years ago

Hi all!

Sorry for being so silent over the last days. @pcm32 a telco sounds great. @MoHeydarian also would like to help with scRNA and Galaxy integration so we are a pretty big team. A telco to coordinate the effort would be great. I did some research about the intermediate file format and I think the loom format is the way to go, at least until HCA comes up with something better, that has such a good support across languages. Does anyone want to help trying loom in her/his favorite tool?

Meeting next week Thursday? I could make a doodle to find a timeslot that fits from USA to Australia :)

pcm32 commented 5 years ago

Hi all, maybe we can still make it for a meeting this week, please fill in the doodle. Otherwise let me know if times don't fit:

I've set up this doodle poll so that we can agree on a time: https://doodle.com/poll/w8d4vhy3ikd4p964

pcm32 commented 5 years ago

From scRNA-tools database (https://www.scrna-tools.org), this are the 25 most highly cited tools (this could be of interest for deciding what other tools to add or in which order):

Name Platform DOIs Citations License Categories
inferCNV R 10.1126/science.1254257 752 - Variants, Visualisation
BackSPIN Python 10.1126/science.aaa1934 626 BSD 2-clause Gene Filtering, Clustering
Monocle R 10.1038/nbt.2859;10.1038/nmeth.4150;10.1101/110668;10.1038/nmeth.4402 586 Artistic-2.0 Clustering, Ordering, Differential Expression, Marker Genes, Expression Patterns, Dimensionality Reduction, Visualisation
SPADE R 10.1038/nbt.1991;10.1038/nprot.2016.066 377 GPL (>= 2) Clustering, Ordering, Marker Genes, Dimensionality Reduction, Visualisation
Seurat R 10.1038/nbt.3192;10.1101/164889;10.1038/nbt.4096 329 GPL-3 Normalisation, Imputation, Integration, Gene Filtering, Clustering, Differential Expression, Marker Genes, Variable Genes, Dimensionality Reduction, Visualisation
scLVM R/Python 10.1038/nbt.3102 312 Apache-2.0 Normalisation, Variable Genes, Cell Cycle, Visualisation
SCDE R 10.1038/nmeth.2967 232 - Differential Expression, Gene Sets, Visualisation
salmon C++ 10.1101/021592;10.1038/nmeth.4197;10.1101/335000 215 GPL-3 UMIs, Quantification
CellRanger Python/R 10.1038/ncomms14049 184 - Alignment, UMIs, Quantification, Quality Control, Clustering, Differential Expression, Marker Genes, Dimensionality Reduction, Visualisation, Interactive
Wishbone Python 10.1038/nbt.3569 121 GPL-2 Ordering, Expression Patterns, Visualisation, Interactive
MAST R 10.1186/s13059-015-0844-5 114 GPL (>= 2) Quality Control, Normalisation, Differential Expression, Gene Sets, Gene Networks
SC3 R 10.1101/036558;10.1038/nmeth.4236 96 GPL-3 Clustering, Interactive
scran R 10.1186/s13059-016-0947-7;10.1101/165118;10.1038/nbt.4091 91 GPL-3 Normalisation, Integration, Variable Genes, Cell Cycle, Visualisation, Interactive
SCUBA MATLAB 10.1073/pnas.1408993111 89 - Ordering, Expression Patterns
ZIFA Python 10.1186/s13059-015-0805-z 89 MIT Dimensionality Reduction
BASiCS R 10.1371/journal.pcbi.1004333;10.1186/s13059-016-0930-3;10.1101/237214;10.1016/j.cels.2018.06.011 89 GPL (>= 2) Normalisation, Differential Expression, Variable Genes, Simulation
DPT R/MATLAB 10.1038/nmeth.3971 89 GPL-3 Ordering, Expression Patterns, Visualisation
TraCeR Python 10.1038/nmeth.3800 87 Apache-2.0 Assembly, Alignment, Quantification, Immune, Visualisation
AltAnalyze Python 10.1038/nature19348 83 Apache-2.0 Quantification, Normalisation, Gene Filtering, Clustering, Classification, Differential Expression, Marker Genes, Gene Sets, Gene Networks, Cell Cycle, Dimensionality Reduction, Alternative Splicing, Visualisation, Interactive
umis Python 10.1101/073692;10.1038/nmeth.4220 72 MIT UMIs, Quantification
STEMNET R 10.1038/ncb3493 71 GPL-3 Ordering, Visualisation
RaceID2 R 10.1016/j.stem.2016.05.010 68 - Rare Cells
StemID R 10.1016/j.stem.2016.05.010 68 - Ordering, Stem Cells
TSCAN R 10.1093/nar/gkw430 61 GPL (>= 2) Clustering, Ordering, Marker Genes, Visualisation, Interactive
destiny R 10.1093/bioinformatics/btv715 56 GPL Dimensionality Reduction, Visualisation
pcm32 commented 5 years ago

Towards trying to track existing advances (this is by no way comprehensive, just an start with the things I'm aware) I though I'd put links to different parts (Bioconda pkg, scripts pkg, Galaxy wrapper, Loom support) for each of the tools (please add additional tools that you are working/have an interest or know there are existing efforts):

https://docs.google.com/spreadsheets/d/1Bze_-u3SfNa4CS7PZOocLmmEw9IhhyxpjYB5tEM56oU/edit#gid=0

Please request access. I'm happy to move this to a proper tracker like Trello or Pivotal if people want. Main objective for me is that if someone has already started on any of these, we contribute work there instead of duplicating, and if you want to start working on something, you can declare where you are leaving that for reusing it. I hope that we manage to have a chat on Thursday!

pcm32 commented 5 years ago

@mblue9 I'm trying your draft Galaxy wrappers for seurat-scripts, but datatype rds doesn't seem to be recognised in my setup. Do I need to add a particular datatype or should we switch it to the more widely available rdata Galaxy datatype? Thanks! I can PR if needed.

bgruening commented 5 years ago

@pcm32 its only in dev yet: https://github.com/galaxyproject/galaxy/blob/dev/config/datatypes_conf.xml.sample#L506

I filled the table a little bit. See you soon!

pcm32 commented 5 years ago

That seems to be extension .rdata.seurat, whereas wrappers use .rds, is there a correction needed in the wrappers? Can instances pull datatypes somehow when them import a tool from the toolshed or do admins need to fiddle with the datatypes_conf.xml file for this to work out of the box? Thanks!

mblue9 commented 5 years ago

@pcm32 sorry yeah, I was experimenting with an rds datatype and had added that datatype in my local test Galaxy. I've changed it to rdata in those xmls now.

Also sorry I can't make a call at those times this week (my Oz timezone makes things a bit awkward). I'm also going to be tied up with non-scRNA work for the next little while but I'd still be keen to contribute here as I can. I could maybe keep going with seurat.

@pinin4fjords thanks for the comments on the viz. Yeah may be better to just stick to the one function per wrapper and not worry about how the tool panel looks at this point (the tool panel does have a search function :smile:). But in that case I would be really interested in seeing if there's an automated way to generate the wrappers as a starting point, as that was a Lot of copy/paste to make those seurat wrappers ^^, without even adding many options/labels/help so far (and I found myself drifting into inconsistency between the wrappers in parameter naming etc).

@pcm32 I've requested access to your spreadsheet. For this:

I'm happy to move this to a proper tracker like Trello or Pivotal if people want.

what about a Github Project board? I haven't used them myself but wondering if that could work, rather than using another tool and seeing as we're all here in Github anyway.

mtekman commented 5 years ago

(hi everyone, back from holiday) I'd also prefer to keep things on Github just for convenience. I take it that the telco date is on the 14th?

pcm32 commented 5 years ago

Hi all, sorry for the delay, best option seems to be Fri 14/09 at the time shown here: https://doodle.com/poll/w8d4vhy3ikd4p964

(This is 14:15 BST/CET time)

There are call details there in the link as well. Should we have issues, will post another call link here.

pcm32 commented 5 years ago

We start the call in 10 minutes here: http://meet.google.com/kju-fecm-uwm

blankenberg commented 5 years ago

Unfortunately missed this call, hope things went well.

blankenberg commented 5 years ago

Just to close the loop, Loom datatype was added to Galaxy 18.09 here: https://github.com/galaxyproject/galaxy/pull/6723

suhaibMo commented 5 years ago

Hi all, I'm suhaib(@suhaibMo) working with @pcm32 and @pinin4fjords. Previously I'd written few R-scater wrappers for data processing functions (https://github.com/ebi-gene-expression-group/bioconductor-scater-scripts) that has been integrated in Bioconda recipes (https://github.com/bioconda/bioconda-recipes/tree/7d1f13c7f91fc65ed235eb4b860cfdb0287ab082/recipes/bioconductor-scater-scripts). I'm moving to write Galaxy wrappers (newbie) for Scater which I'm getting familiarise with the process and XML schema. However, I understand @mtekman is planning to write galaxy wrapper for scater ?.I aim to have following wrappers for scater

scater-read-10x-results.R scater-normalize.R scater-calculate-cpm.R scater-extract-qc-metric.R scater-calculate-qc-metrics.R scater-is-outlier.R Should anyone planning to write any of the above or ongoing could you please ping so don't duplicate or perhaps re-use. Thanks !

pcm32 commented 5 years ago

Hi there! after inspection of scanpy, I see that this is as well mostly a library and we would need to write scripts to get direct executables. I will start this effort and comment here shortly the repo where I'm leaving those scripts. I'm trying to adhere as much as possible to the r-seurat-scripts interfaces defined.

bebatut commented 5 years ago

@pcm32 I planned to work on scanpy too. My idea was to use a script to generate "automatically" a skeleton of the wrapper for each module of the library. Do you want to do that together?

pcm32 commented 5 years ago

So, looking at scanpy methods I see quite a good amount of functions, and then on top of that there is functionality that relies on python idioms that are not exactly functions. For instance the filtering is done like:

adata = adata[adata.obs['n_genes'] < 2500, :]
adata = adata[adata.obs['percent_mito'] < 0.05, :]

So I'm not sure how that would go automatically. But do give it a try and see where you get.

The other thing I'm after is to have a similar interface to what we have in r-seurat-scripts, wherever possible.

I'm uploading what I have so far here: https://github.com/ebi-gene-expression-group/scanpy-scripts/tree/feature/filter-cells

pcm32 commented 5 years ago

By the way @bebatut, I'm following this for inspiration of usage: https://nbviewer.jupyter.org/github/theislab/scanpy_usage/blob/master/170505_seurat/seurat.ipynb

and this is what I'm trying to emulate in term of scripts (names, inputs, etc): https://github.com/ebi-gene-expression-group/r-seurat-scripts

But of course we should discuss how to proceed.

mtekman commented 5 years ago

@suhaibMo I have no active plans on Scater at the moment, but it seems that you and I may have some overlap in https://github.com/galaxyproject/tools-iuc/pull/1841

suhaibMo commented 5 years ago

@mtekman Thanks for the point out. Looking at the commits I understand you have embedded R functionality written within the xml at many instances in contrast we aim to have it written outside xml and call it within. So perhaps I could use the ones that has some overlap as template to work on. https://github.com/ebi-gene-expression-group/tools-iuc/tree/scater_scripts/tools/scater_scripts

mblue9 commented 5 years ago

By the way @bebatut, I'm following this for inspiration of usage: https://nbviewer.jupyter.org/github/theislab/scanpy_usage/blob/master/170505_seurat/seurat.ipynb

I'm still tied up with non-scRNA-seq work at the moment but just to add here, that link is to the scanpy version of a seurat tutorial and I was thinking of making a Galaxy tutorial version of that when the tools are available (and when I get time) so that then there'll be R/Python/Galaxy versions of the same tutorial (we're teaching all 3 languages here). I know @mtekman's created an scrna-seq tutorial for upstream analysis (https://github.com/galaxyproject/training-material/pull/969) and @ethering mentioned a training course so just wondering if anyone is thinking of creating training material for these tools discussed here?

pcm32 commented 5 years ago

That is great @mblue9. We are actually following that tutorial to write the scanpy scripts. @nh3 started with the Sanger and us and is continuing work on that on the repo I mentioned above. Welcome @nh3!

nsoranzo commented 5 years ago

This paper may be of interest: "scRNA-seq mixology: towards better benchmarking of single cell RNA-seq protocols and analysis methods" https://www.biorxiv.org/content/early/2018/10/03/433102

mblue9 commented 5 years ago

Great @pcm32 ! @nsoranzo that paper is definitely of interest (they also published a similar paper last year on bulk RNA-seq that I have been thinking of using for benchmarking/training)

bebatut commented 5 years ago

I started the integration of each scanpy function as a wrapper (#2121). It is complementary to @pcm32 and @nh3 work, just a smaller level of granularity. The idea would be to share parts using the macros

bgruening commented 5 years ago

Hi all. So here is our summary from the European Galaxy conference and a list of repositories, branches, and PRs I know of that might be relevant.

https://docs.google.com/document/d/1RpWWsox6XrdqNjW4ffbiO76txfuc-26XfwKJdnTpisE

The short summary, but please read the document, is that we would like to try out loom as an intermediate file format for (hopefully) all tools. Which means that we would like to integrate small shims into every Galaxy tool to spit out loom if the upstream package does not yet support it. The other topic was how we structure the tools, we identified 3 options that are currently used and we discussed them a bit and agreed to try version 3 - please look at the document for more details.

I just want to stress one point, this is something which we discussed and what we want to try. This is and never was an IUC decision. We just discussed all options and came to a conclusion that we try version 3. Small story, @mtekman rewrote the RaceID wrappers, I think, 4 times, every version was tested with real users and feedback was collected. I don't think this is bad, its time consuming, but in the end I think we are happy with the outcome. Same is true for the DESeq2 wrapper. The current interace is an iteration about many different version - until we come up with the one that is now used multiple throusand of times (thanks @pavanvidem for the initial work on it).

That said I don't think its a waste of time to try different approaches, as long as we all work together and can come up with a maintainable and useable scRNA tool suite.

Sorry for all the confusion and for me being so late to drop the discussion somewhere. I'm fine with moving the discussion to GDocs if it makes it more clear that this has nothing to do with IUC in the first place.

Thanks again to all of you - lets create something awesome!

pcm32 commented 5 years ago

hi all! if you are at GCC today (Wednesday), we are meeting next to the Ice creams during poster session at 15:10 local time to discuss about integrating our Galaxy Single Cell tools.

hepcat72 commented 4 years ago

I wanted to make a few minor suggestions about this effort.

  1. Add a way to set check_duplicates = FALSE in seurat so that seurat isn't likely to fail when you're analyzing a small test dataset. See issue 749 in seurat and the issues it links to.
  2. Keep split-seq in mind so that it can be dropped into a workflow where similar tools are used.
  3. Provide a conversion from a bundle including .mtx, genes.csv, and cells.csv to a tsv matrix that can be input into the seurat tool (or provide a way to input those files into seurat using its read10x method (as long as split-seq output is compatible).