AlexsLemonade / scpca-downstream-analyses

This repository is intended to store our pipeline for marker genes analysis.
0 stars 0 forks source link

Generate use case examples for scpca-downstream-analyses #61

Closed allyhawkins closed 2 years ago

allyhawkins commented 2 years ago

Eventually we plan to turn the pipelines being developed in this repo to something that can be used by external users. I'm making this issue to track developing use case scenarios for these pipelines. We will keep track of the use cases that we develop in this google doc and then turn these into requirements.

allyhawkins commented 2 years ago

We have generated a draft of potential use cases that can be found in the google drive and is ready for review.

For some more context, we are planning to offer a core pipeline that can take quantified single-cell data as input and perform normalization, dimensionality reduction, and clustering using different clustering methods, allowing the user to choose the clustering that they would like to use for their downstream analysis. We are then planning on offering different modules that would be an optional extension of the core pipeline, such as exploring the gene expression of specific marker genes, differential gene expression, data integration, etc.

We have put together a document describing potential use cases where a user would be looking to use the pipeline available for downstream single-cell analysis. The beginning of the document describes three types of potential users that we have kept in mind while generating the use cases. We will then use these use cases to identify what requirements we need to have in our pipeline.

Note that we did include some cases that could be beneficial to have, but may be too niche and outside the scope of what we actually would like to include in the pipeline. If you think there is something that should not be included, or is not a candidate for being incorporated, please comment that so we can better prioritize when making requirements.

On that same note, if there are additional uses that you think of that we may have missed please also include that in your comments.

Tagging relevant folks on the science team, @jashapiro @jaclyn-taroni @cbethell @sjspielman, to take a look. Please go through the document and leave any comments by EOD on Friday, March 4th.

jaclyn-taroni commented 2 years ago

I've taken a look at the document. I think that anything for Bio Experts is going to be outside the scope of this project because:

jashapiro commented 2 years ago

I've taken a look at the document. I think that anything for Bio Experts is going to be outside the scope of this project because:

  • Having it work will require getting the environment set up correctly. Consider the bumps we've had internally with working with renv which is now a requirement.
  • That audience is probably better served by us moving some of the steps (e.g., normalization) with "sensible defaults" into the production pipeline and being able to download it from the web interface of the portal.

I agree with this, especially the second point. The format that we are providing the data in is not the most directly compatible with existing pipelines or GUI tools, so some willingness to engage with files and format transformations is going to be required. I can imagine this changing in the future, but probably not for a little while.

I have other thoughts about renv specifically, but that is probably a separate discussion.

jaclyn-taroni commented 2 years ago

I have other thoughts about renv specifically, but that is probably a separate discussion.

I was just trying to illustrate my point by invoking renv. Not a comment on whether it should be a requirement!

allyhawkins commented 2 years ago

I've taken a look at the document. I think that anything for Bio Experts is going to be outside the scope of this project because:

  • Having it work will require getting the environment set up correctly. Consider the bumps we've had internally with working with renv which is now a requirement.
  • That audience is probably better served by us moving some of the steps (e.g., normalization) with "sensible defaults" into the production pipeline and being able to download it from the web interface of the portal.

This makes sense to me. I could definitely imagine how bio experts would want to be able to use something where they input their data and output objects ready for them to do exploratory analysis, but we had talked about that generally being difficult to adapt what we currently have to something they can easily use. I went ahead and removed that user from the document and either removed use cases that were specific to just them or updated the use case to be reflective of how a novice data analyst would use it (e.g. wanting to apply a process across multiple samples in an efficient manner).

I also addressed most of the comments, adding more specific statements about system requirements when there were questions about that.

There were two use cases that generated confusion and didn't seem feasible, so I removed those as well. They were ones that I had included as more thought provoking use cases for things we could consider users would like rather than things we definitively needed to have.

There was a few comments that @jashapiro included at the end of the document that I wanted to address:

It seems like many of the use cases described are for external data, not ScPCA results. While integrating ScPCA data with external data seems like it is within the scope of this project, I would be wary of trying to make this too general. (Put simply, we are not going to be able to keep up!) We may want to consider use cases where people want to take ScPCA data out of our system to their own.

Could you provide an example of what you mean take ScPCA out of our system into their own? I believe I went through and made some of the use cases more specific to ScPCA data, while still leaving some for external data. I do agree that we won't be able to keep up with everything, but I think the ability to take in the output from Cell Ranger (as either hdf5 files or the mtx file) is feasible and going to be fairly common.

Related, I want to be very careful about clearly defining things we do think should not be done. While we did strive to make scpca results comparable to cell ranger, any comparisons between the two should come with a blaring warning. Similarly, importing normalized data from one system to use with a different set of normalized data seems very fraught!

I would agree that we should not be using different sets of normalized data and only be using raw data, so I made that more clear. But that being said, I can imagine that many users of ScPCA are looking to validate findings or add to their cohort if they don't have access to a lot of their own patient data, in which they will have both ScPCA data and external data. I agree that they should be proceeding with caution (and we can provide those cautionary warnings), but I think having the pipeline work with both ScPCA data and Cell Ranger data (which we are partially at that point right now), is not out of the question. Perhaps we avoid integrating those two datasets at this point if they have been quantified using different methods, but being able to process ScPCA and external data seems like something that is worthwhile.

Cell type abundance analysis seems like a use case we should support. i.e. I want to know if a particular cell type that I am interested in is more common in one set of samples than another.

See Use Case 10. If there is something that I am missing from here that you had in mind, please let me know.

Please let me know if there are any other additional comments. I am planning on starting to move the ideas from use cases to requirements in #66 as a next step.

jashapiro commented 2 years ago

Could you provide an example of what you mean take ScPCA out of our system into their own?

I mostly just meant converting SCE objects to seurat/mtx etc. to prepare people to use other tools/workflows.

See Use Case 10. If there is something that I am missing from here that you had in mind, please let me know.

I think that covers a good part of what I was thinking, but there might be a more specific task of comparing between two sets of samples (biological replicates, hopefully). In that case, I would expect that outputting cell type proportions may not be sufficient. I imagine raw cell counts would be required, and I can imagine some system (I may be just speaking from hope) that would allow incorporating uncertainty in cell type assignments when identifying differential abundance.

allyhawkins commented 2 years ago

Closing this issue and noting that the use cases can be found in the google drive in ScPCA/Downstream Analyses/ScPCA-Downstream Analysis- Use Cases.