Add genes detected comparison across pre-processing tools

allyhawkins commented 3 years ago

Closes #109. Here I did a quick comparison to look at what genes are detected when using either cellranger or alevin-fry (either with cr-like or cr-like-em). To generate the rowdata data frame that is used, I ran the output files from running alevin-fry through the benchmarking_generate_qc_df.R script.

For the most part all of the genes that are able to be quantified are the same, with ~ 200 genes that are found in cellranger but not found in alevin-fry. When looking at the genes that are not identified in all tools, they have lower mean gene expression. I then did a quick pathway analysis to see what type of genes would be lost if we were to use alevin-fry over cellranger. For this, I chose to do over-representation analysis with the list of "lost" genes as the input target gene list and all genes identified in alevin-fry and cellranger as the background list. I wasn't quite sure if this was the best approach, but think it should be sufficient in terms of looking at any pathways that may be enriched in the lost gene list. However, the low gene expression and low % of cells the lost genes are detected in makes me less worried about the overall contribution they would have to the final output. I think we would be safe in using alevin-fry and not lose important information in comparison to cellranger. If we really wanted to test that theory we could go further and do clustering and look for marker genes and see if any of those change, but I don't know if that's entirely necessary.

allyhawkins commented 3 years ago

One way to deal with that is my suggestion below to look at the fully overlapping sets, but the other thing we could do is right at the start to restrict the analysis to only the common genes between the two before filtering for expression > 0.

Thanks for this suggestion and I think this makes complete sense! As the genes that are actually expressed will be sample dependent, while the ability to detect genes is what we are interested in. I removed the requirement of restricting genes to be expressed before making the comparisons and now you see only 40 genes that are not detected by alevin-fry that are detected by cellranger. When you do that though, there is no real clear way of comparing the genes in the different resolution modes, but I am less concerned about that. Also from what I can guess based on looking at a lot of cellranger output and always seeing around 19,000 genes I am pretty sure they also report all genes that are detected but I could be wrong.

Also with this change, we now have the distinction of the gene being detected in all tools or being unique between cellranger and alevin-fry so have updated the gene expression plot as shared to mean found in all tools and unique to mean not found in all tools.

jashapiro commented 3 years ago

This isn't quite what I meant... I still want to see the comparisons among mappers for the expression, but only for the genes that are in both indexes. I went and checked the first two genes that are "missing" with CellRanger, and they are missing because they aren't in Ensembl 103, as I thought might be the case.The pages for ENSG00000130723 and ENSG00000263264 show that they were retired after v102, so this is not a tool difference, but an index difference.

What I think we want to do is select only the genes that are in both indexes, regardless of expression, then do the previous analysis where you looked at which are being expressed/detected in each sample and how that differs between the tools. It still isn't perfect (having the same reference version would be better!), but I don't think it is necessarily worth going back to redo all of the mapping.

allyhawkins commented 3 years ago

Sorry about the mis interpretation of your previous comment @jashapiro . I believe I have addressed what you were asking for and now have first restricted the analysis I was doing previously to only those genes that are present in the indices of both alevin-fry and cellranger. So before restricting by mean > 0 and detected > 0, I filtered the data frame containing the list of genes by those being found in both tools. Then I filtered on expression and then obtained the lists of genes found in each tool. I believe this should account for any changes in genes detected based on the index alone and should highlight any changes in expression of genes across the tools in the same set of samples. Regardless, I don't think we are losing much information and interestingly similar pathways are coming up in the gene sets that are unique to both cellranger and alevin-fry cr-like-em.

AlexsLemonade / alsf-scpca

Add genes detected comparison across pre-processing tools #116