Add figures needed for FAQ - why did we use Alevin-fry?

allyhawkins commented 2 years ago

Based on feedback received in AlexsLemonade/scpca-docs#21, I am breaking out the script used to create the figures needed for the FAQ: Why did we use Alevin-fry? I also am storing the figures here and will then include a permalink to the figures in the docs, rather than move them over to the scpca-docs repo.

This script is hard-coded to take as input the results from previous benchmarking analysis that we've done with Alevin-fry using cr-like with selective alignment and Cell Ranger for two single-cell and single-nuclei samples. As part of the previous benchmarking analysis these sce objects were generated using analysis/quantifier-comparisons/benchmarking_generate_qc_df.R.

I'm creating three plots to be used to compare Alevin-fry to Cell Ranger, two density plots of the distribution of UMI/cell and genes detected/cell and a scatter plot showing the correlation of mean gene expression between the two tools in each sample. After talking with Josh about the plots, we decided to switch to using a density plot to better compare the two distributions and then also show it on log-scale. Additionally, I changed the plots to be labelled with the library ID and what type of sample (cell or nucleus).

I'm attaching the files here for easy review:

UMI/cell comparison Genes detected/cell comparison Gene expression correlation

allyhawkins commented 2 years ago

I went ahead and removed the color from the bars above the plots so that there's no more combining of plots and plots should now be consistent across all of them. I also went ahead and made the minor edits to the aws s3 cp statement for the metadata and removed the extra filtering step for the rowdata. I am now only filtering for genes that are detected in > 5% of cells and then dropping any genes that are not found in both Alevin-fry and Cell Ranger using drop_na() when spreading the mean values into individual columns, rather than using the additional step.

allyhawkins commented 2 years ago

The formatting looks good to me, but before I approve, I want to note that the change from the previous version to this one in the correlation plot is quite substantial. I am not sure I understand why that might have happened. I would not have expected a change based on my understanding of the transformations that were done, but it looks like the lowest expression genes are now being excluded?

So I also noticed this and after going through it multiple times and trying to figure out where there could be a difference in the two different methods it looks like there was an error with the original plot and we weren't removing the low covered genes, when that should have been happening. I went through and triple checked and with removing the low covered genes (by filtering out genes with detection of <5% of cells) the correlation plots should look like the ones that are now committed. This is the case if I have the additional step of filtering by genes found in both tools prior to making the spread out dataframe or not. I will go ahead and adjust the size to match the other graphs though and then will add the new plot.

AlexsLemonade / alsf-scpca

Add figures needed for FAQ - why did we use Alevin-fry? #141