-cells/--expected_cells still required when -wl provided?

bbimber commented 3 years ago

This isnt a huge issue, but I gather you're near to a release so I thought I'd bring it up. So far as I can tell, the tool requires the user to enter "-cells", even when a whitelist (-wl) is provided. I know there's a workaround to enter "-cells 0", but that's not especially intuitive and ideally one would not need to supply -cells when -wl is used. Not a huge problem, but thought i'd mention it.

Hoohm commented 3 years ago

Hey Ben, thanks for the issue.

I've looked at this in the past as well and here is the problem I'm faced with.

Here are two ways I looked at it: A. Ask the user for the number of cells or whitelist. Use the size of the whitelist when provided B. Ask for the user for the number of cells and the whitelist (if exists).

If the user provides the whitelist as a "short reference" coming from the subsetted mRNA data, (A) makes the most sense. it requires one less input from the user and the data comes out as you need it. (B) still works but the user needs to add one more variable.

If the user provides the "full whitelist", as you might do for 10xv3, (A) will catch a lot of cells we don't really want with really low umi counts, etc... Since I'm not doing any filtering in CITE-seq-Count so that user is in control, we might end up ina place where we process, correct cell barcodes and UMIs of cells we don't care about. (B) Solves this issue by having the user filter the data themselves.

I had an option C in mind where the user would provide a "full whitelist" + either cell number or a short subset of cells as list. But this option is more explanation, more potential misunderstanding and I'd like to keep the number of questions down if it comes down to bad design.

For now, I would rather keep option B, but let me know if you have an idea I might have missed.

bbimber commented 3 years ago

Interesting - I didnt realize that's how whitelist and --cells interacted. I was under the impression providing a whitelist was like using option A (hence why it seemed odd to need to provide '--cells=0', when the size could be inferred from the whitelist.

One usage that might be useful would be to treat the whitelist as barcodes to always include, but then also include additional high-UMI-count cells (governed by --cells). Our usage might explain that:

We are using Cite-Seq-count as basically a first-pass. We do filtering and QC downstream in R (currently our cellhashR package), which filters cells by min count. Having Cite-Seq-Count provide an input count matrix with all requested cells is useful, since the difference between whitelist and passing (i.e. above min counts) can tell you something about the library.

Our primary use-case is to provide a whitelist generated using either: 1) the whitelist of cell barcodes with passing GEX data, or 2) the whitelist of barcodes with passing TCR data (which is not necessarily identical the passing GEX). In each case, we often want maximal calling rate, and we might want to take steps to recover cells with low cell hashing-counts. I have no real problems that I know about with the current behavior of providing CITE-Seq-Count whitelist, and it producing a matrix with those cells.

The theoretical reason an output that is the union of whitelist and high-UMI-count cells could be advantageous is that in the case of TCR, the TCR whitelist might actually be a subset of true cells. For example, if you sequence PBMC, only ~15% of cells might actually be T-cells. I could probably solve this myself by giving Cite-Seq-Count a whitelist that is actually the union of GEX and TCR cell barcodes. This is not something Cite-Seq-Count needs to do itself.

Anyway, that's a long explanation. The core thing I was asking is that it seems unnecessary for Cite-Seq-Count to force the user to enter '--cells-0' when a whitelist is provided. It seems that if a whitelist is provided and --cells is not provided, that the tool should do something like default to the length of the whitelist. I dont have any real problems with the actual performance of Cite-Seq-Count as-is.

Hoohm commented 3 years ago

There is another person who tested the new version and prefers the old paradigm.

That's a high return rate over the few people who tried the new version.

I guess I should think about it.

The next step for CSC is to use a centralized chemistry definition, which will hold the full reference list.

The main idea is to propose to the use to only have to use --chemistry 10xV3 and it will download reference lists, knows barcodes position, etc...

I could change this a bit by keeping the old way of asking for the "short" reference list, but, proposing a new argument being the full reference list.

I'm just afraid this could get confusing to users.

bbimber commented 3 years ago

on all these points, we're pretty flexible. we'll adapt to a new argument paradigm in a new version or keep as-is.

the only thing i intended to bring up on this thread was what seemed like redundancy in CSC requiring cells=0 when a whitelist is provided. i now see it's more complicated.

Hoohm / CITE-seq-Count

-cells/--expected_cells still required when -wl provided? #142