Comparison of Alevin-fry to Cellranger for 5' libraries

allyhawkins commented 3 years ago

Closes #133. This PR adds a comparison of using Cell Ranger to Alevin-fry for pre processing 5' libraries. Here, I am analyzing 2 10Xv2_5prime samples, SCPCR000286 and SCPCR000287, that have been processed with Cell Ranger and Alevin-fry. For Cell Ranger, the orientation is automatically detected based on how many reads are in the antisense direction (this was confirmed for these samples by looking at the web_summary.html). For Alevin-fry, I am using the same configurations that we are using for 3', selective alignment with the splici index, unfiltered permit list, and cr-like-em resolution, but with the use of the ISF flag for library type, indicating the orientation of the libary.

For this analysis, I generated the per cell and per gene metrics using the benchmarking_generate_qc_df.R script which creates the quant_info.tsv, coldata_df.tsv, and rowdata_df.tsv used for inputs to the analysis outlined in this notebook. In running that script again, I noticed an error in use of scpcaTools::import_quant_data() and had to add in the mtx_format option (this is described in AlexsLemonade/scpcaTools#46). This function is utilized in the make_sce_list.R benchmarking function and in order to successfully run using mapply I had to add in that argument.

After generating the necessary qc dataframes needed for plotting, I went ahead and restricted the analysis to only cells that were identified in both tools. I then compared the following:

mitochondrial content/cell
UMI/cell
genes detected/cell
mean gene expression
overlap of genes detected across the tools

Interestingly, these results are not as good as we have seen previously and the alevin-fry quantifications are showing much lower UMI/cell and genes detected/cell than with Cell Ranger. Also the correlation coefficients of mean gene expression are in the range of 0.5-0.6, whereas with the 3' data we were seeing coefficients above 0.98. I double checked the log files to ensure that both of these were run with the correct configurations using the ISF library type with Alevin-fry, but it appears that these are not as consistent as we would have hoped.

One thought I had is that the 5' data doesn't do as well with the splici index or the other configurations that we are using so we can try out other options with Alevin-fry, but I am open to other suggestions or ideas.

jashapiro commented 3 years ago

Ugh, these do not look good. I suspect that just switching to ISR may not be sufficient.

I'm worried that alevin may be reverse complementing the cell barcodes, such that they would no longer match the provided list. This wouldn't have been a problem until the introduction of the unfiltered mode, as cell barcodes were only checked within libraries, so RC everything would be fine.

One test would be to run alevin in knee mode, which would allow it to choose its own barcodes. Then we could check if those barcodes match reverse complements of the CR barcodes.

If that works, we could provide a reverse complemented barcode list for 5' samples, which we would then have to reverse after import. Which isn't the worst thing in the world, but does complicate matters.

It is also possible that the ability to specify read geometries added in salmon 1.4 might give us a solution, if we are allowed to use reversed start and finishes. So we might be able to specify --read-geometry 2[end-1] --bc-geometry 1[1-16] --umi-geometry 1[17-26] rather than the default --read-geometry 2[1-end] --bc-geometry 1[1-16] --umi-geometry 1[17-26]

Not sure if that is supported, but we could ask. (I think it would be worth diving a bit deeper to see if we can find if this is the problem before we did that though.)

I'm also looking a bit at https://salmon.readthedocs.io/en/latest/library_type.html, and wondering if MSF might actually be the read format we want for 10xv2 5prime? I'm probably getting myself confused. Sort of curious what alevin would choose if given auto as an option. We could also try IU for no strand preference on the 5' read?

rob-p commented 3 years ago

So, I haven’t had a chance to look into this in any depth yet, but one thing worth noting is that, in the alevin-fry pipeline, the specification of the library type is mostly immaterial. This is because after discussion with @k3yavi, we made the decision to have the RAD file contain reads mapping in both orientations, and to leave orientation filtering to the generate-permit-list step. So, the filtering by read orientation should happen in the fry step.

@jashapiro — I’m curious about the hypothesis regarding reverse complementing of the barcodes. Is it the case that the read containing the barcode / UMI in this protocol isn’t in a prescribed orientation? For processing 5’ data, I know that currently we are not making use of the biological sequence on the barcode / UMI read (though this is on our radar), but I was under the impression that the barcode / UMI extraction was otherwise basically the same. Is there a reason this should not be the case?

rob-p commented 3 years ago

@allyhawkins & @jashapiro : one more thought I had — what does it look like if you use —sketch mode here instead. While selective alignment provides higher precision, it does not tolerate a lot of divergence (at least with the default parameters in alevin). If the 5’ reads are often softclipped, this could lead to a lower alignment rate in this data using selective alignment.

jashapiro commented 3 years ago

@jashapiro — I’m curious about the hypothesis regarding reverse complementing of the barcodes. Is it the case that the read containing the barcode / UMI in this protocol isn’t in a prescribed orientation? For processing 5’ data, I know that currently we are not making use of the biological sequence on the barcode / UMI read (though this is on our radar), but I was under the impression that the barcode / UMI extraction was otherwise basically the same. Is there a reason this should not be the case?

Yeah, that seems to have been my mistake. I was thinking that if the mapping was strand-specific, then cell barcode mapping might be too. Some of our comparisons between mappers start by filtering to shared barcodes, so if that was the problem we could end up filtering out many of the "real" cells. But it doesn't look like that was the issue.

So, I haven’t had a chance to look into this in any depth yet, but one thing worth noting is that, in the alevin-fry pipeline, the specification of the library type is mostly immaterial. This is because after discussion with @k3yavi, we made the decision to have the RAD file contain reads mapping in both orientations, and to leave orientation filtering to the generate-permit-list step. So, the filtering by read orientation should happen in the fry step.

Ah! We had missed (or forgotten about) this! We didn't change the orientation! @allyhawkins: we need to update this line to either rc or both https://github.com/AlexsLemonade/alsf-scpca/blob/2011c305ce80ba8afe160a35cf58200dd9024aa1/workflows/alevin-quant/run-alevin-fry.nf#L112

allyhawkins commented 3 years ago

@rob-p Thank you so much for the help on this one! Changing from using the fw option to rc for the read orientation while leaving the library type alone seemed to do the trick!

@jashapiro I went ahead and updated the notebook to include the samples run using rc and have removed the knee filtering. In doing so, I did choose to keep the comparisons where I looked at all cells without restricting to shared cells to show how similar the distributions are even before restricting to shared cells. The only caveat here is that it looks like Alevin-fry may be reporting some of the counts and number of genes/cell a bit higher than Cell Ranger, but that seems to decrease a little bit when restricting the analysis to only genes found in > 5% of cells. I am not really that concerned about these differences and feel pretty confident that we should be able to implement use of this workflow for our 5' samples.

AlexsLemonade / alsf-scpca

Comparison of Alevin-fry to Cellranger for 5' libraries #137