Closed hisplan closed 5 years ago
Hello @hisplan
I see that I made some changes that didn't pass running tests and still pushed them. I guess I needed to work on different systems and wanted the latest changes.
I'm gonna work on performance and fixes tomorrow. I'll push some changes by then. Let me know if you see improvements then.
You mentioned 1.4.2 being too slow, would you mind telling me which part was problematic for you?
I'm passing the 10x v2 whitelist and setting --expected_cells=0
. I suspect that CB correction part is slow?
Yes, that's the exact issue I want to fix. Trying to not go towards creating an index for it and this is not trivial.
Another quick fix for you would be to use the filtered list from 10x v2. It will go way faster.
Have you had a chance to work on this issue? I've tried 10x v3 whitelist (millions of barcodes;;) with --expected_cells=0
, and it's been running for more than 3 days...
Hey @hisplan
One quick thing, if you're running a v3 chermistry, I would suggest a cell barcode correction of 0 --bc_collapsing_dist 0
Some barcodes have only on hamming distance between them, so collapsing would be a mistake.
I suspect they have some control code in there because they talk about 3M barcodes but the whitelist has 6M.
@hisplan On the performance part, I haven't been able to get a great increase in speed. One more thing I have to test is to create an index and use this instead of a tree for barcode recovery.
This might take a lot of memory though. You can try this branch. I added some filtering of low content cells to speed up things.
Hi @Hoohm, thanks for your quick reply.
I also had a chance to talk to 10x a few months ago about the whitelist for 10x v3 chemistry. As far as I understand, the whitelist has 6.8M barcodes, but the half of it is for feature barcoding. And I was told that by design there is a mismatch of 2bp between GEX library and feature barcode. I asked if I could get separated lists for GEX and feature barcode libraries, but they said no;;;; I don't know why, but if you could get it, please do let me know :-)
Anyway, given this, would you still suggest --bc_collapsing_dist=0
? I've been running with the value 1
, but if I recall correctly, I did get a message something like this:
Testing cell barcode collapsing threshold of 1
Value is too high, reducing it by 1
Testing cell barcode collapsing threshold of 0
Using 0 for cell barcode collapsing threshold
Interesting. I'm gonna try to get the info as well. The response you got is strange because the mapping between cell barcodes for mRNA and ADTs is here and it has 6M lines... I'm very confused :D
Hmm, the branch I'm working on should skip the cell barcode correction completely if the number is at 0 whereas the older ones don't skip it properly.
Definitely try the brach with this modification. It also stops correcting unmapped UMIs and low UMI cells (1 or 0 UMI for a tag). Both of these filters reduce computation time. It's also reported in the report.
I didn't fully understand it either. But that was pretty much the response I got from 10x. And yeah, I'm using the exact same file. The filename suggests 3M, but it actually contains 6M (3M for GEX, 3M for feature barcoding).
Anyway, mainly I've been using the official 1.4.2 and the develop
branch, but I will try feature/index_whitelist
.
I'm having two problems with the feature/index_whitelist
branch.
When I tried to load the matrix file using Seurat 3's Read10X
function, it threw this exception:
Error in dimnamesGets(x, value) :
invalid dimnames given for “dgTMatrix” object
Using the --dense
parameter threw this exception:
Writing dense format output
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1653, in create_block_manager_from_blocks
mgr = BlockManager(blocks, axes)
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 114, in __init__
self._verify_integrity()
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 311, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1691, in construction_error
passed, implied))
ValueError: Shape of passed values is (5, 819821), indices imply (4, 819821)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/bin/CITE-seq-Count", line 10, in <module>
sys.exit(main())
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/cite_seq_count/__main__.py", line 483, in main
filename='dense_umis.tsv')
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/cite_seq_count/io.py", line 48, in write_dense
pandas_dense = pd.DataFrame(sparse_matrix.todense(), columns=columns, index=index)
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/frame.py", line 424, in __init__
copy=copy)
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/construction.py", line 167, in init_ndarray
return create_block_manager_from_blocks([values], [columns, index])
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1660, in create_block_manager_from_blocks
construction_error(tot_items, blocks[0].shape[1:], axes, e)
File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1691, in construction_error
passed, implied))
ValueError: Shape of passed values is (5, 819821), indices imply (4, 819821)
Hello @hisplan where are you on this? Have you been able to run it? I'm wrapping things up for a new relase.
With the entire 10x v3 barcode whitelist, it’s just taking forever (waited a couple of days but had to kill it eventually).
The workaround that I implemented in our pipeline is:
1.4.2-develop
branch.This workaround finishes in a reasonable time, but I would love to try out the new release.
Great! Don't use the V3 whitelist, it's more than double the number of barcodes. For V3 I would suggest not using the full one and just running without it. Another important thing, the cells you will find in the Protein data with CSC is not going to match the RNA cells. You have to use the mapping they provide on their github page.
I feel bit uncomfortable running without the whitelist. The workaround we have now seems fine.
Anyway, which mapping file are you referring to?
You have to map the whitelist you have from the RNA cell barcodes to the Protein cell barcodes. You can use the file I linked here to do that.
If you use the whitelist from the RNA directly for the protein, you are actually getting RNA from a cell and Protein data from another.
Oh, I’m doing this for hashtag. The scRNA-seq count matrix that I mentioned previously has the error-corrected barcodes which were generated using that list from 10x GitHub repo.
I'm not sure I understand. Which error-corrected barcodes are you talking about? The cell barcode of the TAGS barcode?
The cell barcodes in the scRNA-seq count matrix are already error-corrected. I'm feeding this to CITE-seq-Count via --whitelist
(instead of using the whole 10x v3 whitelist). Since CITE-seq-count is working with a smaller set of whitelisted barcodes, the running time was reasonable for me.
By the way, you can close this ticket. I will try out the new release once it's out.
Yes, that is a sound strategy to use.
If you use only the RNA part of V3 and adding your own cell hashing protocol, all good. If you use REAP seq, protein data from 10x V3, you need to use the mapping between RNA cell barcodes and the Protein cell barcodes.
I'll close the ticket then :)
Due to v1.4.2 being too slow, I tried the
develop
branch (542ec344fe26b1cbbd72194c6ed7b809a82de19b), as you suggested in the other thread at some point, but it failed with the following error:Any idea?