Closed allyhawkins closed 3 years ago
Ugh, these do not look good. I suspect that just switching to ISR
may not be sufficient.
I'm worried that alevin may be reverse complementing the cell barcodes, such that they would no longer match the provided list. This wouldn't have been a problem until the introduction of the unfiltered mode, as cell barcodes were only checked within libraries, so RC everything would be fine.
One test would be to run alevin in knee mode, which would allow it to choose its own barcodes. Then we could check if those barcodes match reverse complements of the CR barcodes.
If that works, we could provide a reverse complemented barcode list for 5' samples, which we would then have to reverse after import. Which isn't the worst thing in the world, but does complicate matters.
It is also possible that the ability to specify read geometries added in salmon 1.4 might give us a solution, if we are allowed to use reversed start and finishes. So we might be able to specify
--read-geometry 2[end-1] --bc-geometry 1[1-16] --umi-geometry 1[17-26]
rather than the default
--read-geometry 2[1-end] --bc-geometry 1[1-16] --umi-geometry 1[17-26]
Not sure if that is supported, but we could ask. (I think it would be worth diving a bit deeper to see if we can find if this is the problem before we did that though.)
I'm also looking a bit at https://salmon.readthedocs.io/en/latest/library_type.html, and wondering if MSF might actually be the read format we want for 10xv2 5prime? I'm probably getting myself confused. Sort of curious what alevin would choose if given auto as an option. We could also try IU
for no strand preference on the 5' read?
So, I haven’t had a chance to look into this in any depth yet, but one thing worth noting is that, in the alevin-fry pipeline, the specification of the library type is mostly immaterial. This is because after discussion with @k3yavi, we made the decision to have the RAD file contain reads mapping in both orientations, and to leave orientation filtering to the generate-permit-list
step. So, the filtering by read orientation should happen in the fry step.
@jashapiro — I’m curious about the hypothesis regarding reverse complementing of the barcodes. Is it the case that the read containing the barcode / UMI in this protocol isn’t in a prescribed orientation? For processing 5’ data, I know that currently we are not making use of the biological sequence on the barcode / UMI read (though this is on our radar), but I was under the impression that the barcode / UMI extraction was otherwise basically the same. Is there a reason this should not be the case?
@allyhawkins & @jashapiro : one more thought I had — what does it look like if you use —sketch
mode here instead. While selective alignment provides higher precision, it does not tolerate a lot of divergence (at least with the default parameters in alevin). If the 5’ reads are often softclipped, this could lead to a lower alignment rate in this data using selective alignment.
@jashapiro — I’m curious about the hypothesis regarding reverse complementing of the barcodes. Is it the case that the read containing the barcode / UMI in this protocol isn’t in a prescribed orientation? For processing 5’ data, I know that currently we are not making use of the biological sequence on the barcode / UMI read (though this is on our radar), but I was under the impression that the barcode / UMI extraction was otherwise basically the same. Is there a reason this should not be the case?
Yeah, that seems to have been my mistake. I was thinking that if the mapping was strand-specific, then cell barcode mapping might be too. Some of our comparisons between mappers start by filtering to shared barcodes, so if that was the problem we could end up filtering out many of the "real" cells. But it doesn't look like that was the issue.
So, I haven’t had a chance to look into this in any depth yet, but one thing worth noting is that, in the alevin-fry pipeline, the specification of the library type is mostly immaterial. This is because after discussion with @k3yavi, we made the decision to have the RAD file contain reads mapping in both orientations, and to leave orientation filtering to the
generate-permit-list
step. So, the filtering by read orientation should happen in the fry step.
Ah! We had missed (or forgotten about) this! We didn't change the orientation!
@allyhawkins: we need to update this line to either rc
or both
https://github.com/AlexsLemonade/alsf-scpca/blob/2011c305ce80ba8afe160a35cf58200dd9024aa1/workflows/alevin-quant/run-alevin-fry.nf#L112
@rob-p Thank you so much for the help on this one! Changing from using the fw
option to rc
for the read orientation while leaving the library type alone seemed to do the trick!
@jashapiro I went ahead and updated the notebook to include the samples run using rc
and have removed the knee filtering. In doing so, I did choose to keep the comparisons where I looked at all cells without restricting to shared cells to show how similar the distributions are even before restricting to shared cells. The only caveat here is that it looks like Alevin-fry may be reporting some of the counts and number of genes/cell a bit higher than Cell Ranger, but that seems to decrease a little bit when restricting the analysis to only genes found in > 5% of cells. I am not really that concerned about these differences and feel pretty confident that we should be able to implement use of this workflow for our 5' samples.
Closes #133. This PR adds a comparison of using Cell Ranger to Alevin-fry for pre processing 5' libraries. Here, I am analyzing 2 10Xv2_5prime samples, SCPCR000286 and SCPCR000287, that have been processed with Cell Ranger and Alevin-fry. For Cell Ranger, the orientation is automatically detected based on how many reads are in the antisense direction (this was confirmed for these samples by looking at the
web_summary.html
). For Alevin-fry, I am using the same configurations that we are using for 3', selective alignment with the splici index, unfiltered permit list, and cr-like-em resolution, but with the use of theISF
flag for library type, indicating the orientation of the libary.For this analysis, I generated the per cell and per gene metrics using the
benchmarking_generate_qc_df.R
script which creates thequant_info.tsv
,coldata_df.tsv
, androwdata_df.tsv
used for inputs to the analysis outlined in this notebook. In running that script again, I noticed an error in use ofscpcaTools::import_quant_data()
and had to add in themtx_format
option (this is described in AlexsLemonade/scpcaTools#46). This function is utilized in themake_sce_list.R
benchmarking function and in order to successfully run usingmapply
I had to add in that argument.After generating the necessary qc dataframes needed for plotting, I went ahead and restricted the analysis to only cells that were identified in both tools. I then compared the following:
Interestingly, these results are not as good as we have seen previously and the alevin-fry quantifications are showing much lower UMI/cell and genes detected/cell than with Cell Ranger. Also the correlation coefficients of mean gene expression are in the range of 0.5-0.6, whereas with the 3' data we were seeing coefficients above 0.98. I double checked the log files to ensure that both of these were run with the correct configurations using the
ISF
library type with Alevin-fry, but it appears that these are not as consistent as we would have hoped.One thought I had is that the 5' data doesn't do as well with the splici index or the other configurations that we are using so we can try out other options with Alevin-fry, but I am open to other suggestions or ideas.