Closed wmacnair closed 1 year ago
Hi @wmacnair,
I'll draft a more in depth explaination shortly. In the meantime, --unfiltered-pl is what you are looking for. One distinction from CellRanger is that there is a --min-reads param in alevin-fry that always filters out cells with < that number of reads. By default it is 10. You can change it if you want. In the past, we have interacted with the CellBender folks and were told that this would almost certainly not be a problem, as our default cutoff of 10 is very permissive.
Beat, Rob
Ok, thanks.
I think sometimes it can be helpful to have the full set of barcodes to see the full knee plot curve. I don't think the barcodes at the bottom are ever used in CellBender, but their library sizes are used by CellBender to determine where the prior on the library size for empties should go.
(I also get the feeling that as CellBender is used more widely, they are coming across applications to datasets which don't fit the assumptions of CellBender so well (e.g. clinical single nuclei samples with substantial contamination.)
I'm now trying simpleaf
on some different data that we know more about. Hopefully there the --unfiltered-pl
approach will work nicely.
Cheers Will
It turned out that this dataset included some files with very little RNA indeed. This made it look like simpleaf
was not finding any "empty" droplets using --unfiltered-pl
, but actually the problem was that there were barely any droplets to find, full stop.
I've since run it on all the samples, using the 10x barcode whitelist (here) and it works fine. Thanks!
Will
That’s great to hear. Thanks for reporting back and closing the issue. Let us know if you have any questions or suggestions in the future.
Hi
I'm looking for a bit of guidance on proper use of
--explicit-pl
or--unfiltered-pl
. I would like to runCellBender
on mysimpleaf
-mapped counts data, and to do that I need a matrix with lots ofempty
barcodes.I first tried with
--unfiltered-pl
and the 10x whitelist here. However, this produced a matrix with ~6k barcodes. In the raw cellranger outputs (_raw_feature_countmatrix.h5), we typically get more like 6M barcodes.I then tried with
--explicit-pl
and the same whitelist, and this time alevin returned a bit over 100k barcodes. However I am now worried that these are noisy...This leads to a couple of questions:
--unfiltered-pl
does some checking of whether a barcode might be a sequencing error, while--explicit-pl
does not?CellRanger
doesn't attempt to correct this? And as a result has a much longer tail of empties?Or do I have this bit wrong? 😅
There are multiple
alevin-fry
github issues with users asking questions in this direction, I think most commonly with a view getting inputs forCellBender
/EmptyDrops
(#47, #71, #74, #113. So (if and when you have time!) it could be helpful to have a definitive answer to this in the main documentation. Possibly even a--raw
flag?I guess a tricky aspect is that tools like
CellBender
are typically written withCellRanger
inputs in mind, and you are trying to improve onCellRanger
... Perhaps a project for masters student to see if careful selection of parameters can getsimpleaf
andCellBender
to place nicely together?Thanks for all your work, and in particular the recent streamlining into
simpleaf
!Best Will