Use of --explicit-pl vs --unfiltered-pl to get equivalent of raw_feature_bc_matrix.h5

wmacnair commented 1 year ago

Hi

I'm looking for a bit of guidance on proper use of --explicit-pl or --unfiltered-pl. I would like to run CellBender on my simpleaf-mapped counts data, and to do that I need a matrix with lots of empty barcodes.

I first tried with --unfiltered-pl and the 10x whitelist here. However, this produced a matrix with ~6k barcodes. In the raw cellranger outputs (_raw_feature_countmatrix.h5), we typically get more like 6M barcodes.

I then tried with --explicit-pl and the same whitelist, and this time alevin returned a bit over 100k barcodes. However I am now worried that these are noisy...

This leads to a couple of questions:

Am I right that --unfiltered-pl does some checking of whether a barcode might be a sequencing error, while --explicit-pl does not?
So I guess that also means that CellRanger doesn't attempt to correct this? And as a result has a much longer tail of empties?

Or do I have this bit wrong? 😅

There are multiple alevin-fry github issues with users asking questions in this direction, I think most commonly with a view getting inputs for CellBender / EmptyDrops (#47, #71, #74, #113. So (if and when you have time!) it could be helpful to have a definitive answer to this in the main documentation. Possibly even a --raw flag?

I guess a tricky aspect is that tools like CellBender are typically written with CellRanger inputs in mind, and you are trying to improve on CellRanger... Perhaps a project for masters student to see if careful selection of parameters can get simpleaf and CellBender to place nicely together?

Thanks for all your work, and in particular the recent streamlining into simpleaf!

Best Will

rob-p commented 1 year ago

Hi @wmacnair,

I'll draft a more in depth explaination shortly. In the meantime, --unfiltered-pl is what you are looking for. One distinction from CellRanger is that there is a --min-reads param in alevin-fry that always filters out cells with < that number of reads. By default it is 10. You can change it if you want. In the past, we have interacted with the CellBender folks and were told that this would almost certainly not be a problem, as our default cutoff of 10 is very permissive.

Beat, Rob

wmacnair commented 1 year ago

Ok, thanks.

I think sometimes it can be helpful to have the full set of barcodes to see the full knee plot curve. I don't think the barcodes at the bottom are ever used in CellBender, but their library sizes are used by CellBender to determine where the prior on the library size for empties should go.

(I also get the feeling that as CellBender is used more widely, they are coming across applications to datasets which don't fit the assumptions of CellBender so well (e.g. clinical single nuclei samples with substantial contamination.)

I'm now trying simpleaf on some different data that we know more about. Hopefully there the --unfiltered-pl approach will work nicely.

Cheers Will

wmacnair commented 1 year ago

It turned out that this dataset included some files with very little RNA indeed. This made it look like simpleaf was not finding any "empty" droplets using --unfiltered-pl, but actually the problem was that there were barely any droplets to find, full stop.

I've since run it on all the samples, using the 10x barcode whitelist (here) and it works fine. Thanks!

Will

rob-p commented 1 year ago

That’s great to hear. Thanks for reporting back and closing the issue. Let us know if you have any questions or suggestions in the future.

COMBINE-lab / simpleaf

Use of --explicit-pl vs --unfiltered-pl to get equivalent of raw_feature_bc_matrix.h5 #92