Hoohm / CITE-seq-Count

A tool that allows to get UMI counts from a single cell protein assay
https://hoohm.github.io/CITE-seq-Count/
MIT License
78 stars 44 forks source link

local variable 'bcs_corrected' referenced before assignment #57

Closed hisplan closed 5 years ago

hisplan commented 5 years ago

Due to v1.4.2 being too slow, I tried the develop branch (542ec344fe26b1cbbd72194c6ed7b809a82de19b), as you suggested in the other thread at some point, but it failed with the following error:

Loading whitelist
Creating barcode tree for whitelist
Counting number of reads
Started mapping
Processing 72 reads
CITE-seq-Count is running with one core.
Mapping done for process 10349. Processed 8 reads
Mapping done
Counter({'TCCCAAGCATTAAGCT': 2, 'CGCGGAAGAGCACACG': 2, 'CCTTTCTGAAAGGAGA': 2, 'ACTTGCCACCAAGTCC': 2, 'ACAATTTAGACATGGC': 2, 'ACGGGGAAGGACGTCA': 2, 'CGAATATCCTTAAGAG': 1, 'GCATAAAGTGCACCGC': 1, 'AAGCACATCACCTTGA': 1, 'ATCGGAAGAGCACACG': 1, 'TTAGCATCAACAGGCC': 1, 'CAACCACTAATAGGTA': 1, 'CCTCTAATCGGTCGTC': 1, 'CCCGGCGTACGGGGAA': 1, 'CCATTAATAATGTTTT': 1, 'CGTGAATTCTGAGGCC': 1, 'CCTTTACCAGCTTTAG': 1, 'TTCCATTCTTTAGCTC': 1, 'ATAAATCACCTCACTT': 1, 'AGGCCGTCCGATCTAG': 1, 'AAGTTGCCATACAAAA': 1, 'ATTCAAACGGCCTGTC': 1, 'TAAACGCAAGCCTCAA': 1, 'GCAAAAAATTTAGGGT': 1, 'GGTGGTCTATAGTGTT': 1, 'GCGCGATTCGATCTGC': 1, 'TAGCAAGGCCACGACG': 1, 'TTCGGGAGGGTAGTCG': 1, 'CGTTTGGTCAGTTCCA': 1, 'TTTTCTTCTGCGTCAG': 1, 'CAGTAGACTCCTTCTG': 1, 'GATGTGGTAGAAGTCG': 1, 'GGCTGCGGACGACCAG': 1, 'ATAGCAAAGCCTCTAC': 1, 'GGCGCATAACGATACC': 1, 'GACCAATCTGACCAGC': 1, 'ACGTATTTAGCCACAT': 1, 'AAACGTCGGCTACAGT': 1, 'TGCCCTACTTGCCCTA': 1, 'CTTGCTGCTAAAGGTC': 1, 'CTCGGAGGAGCACACG': 1, 'GTGAGTTGTTCCATTC': 1, 'GAGTCTACACAGTGTT': 1, 'TACTGCTTGTTTACGA': 1, 'AATTCATCCATTAACT': 1, 'TAAGAGACCATCTTAA': 1, 'TCATAAGAGGTTTTAC': 1, 'GATCGAAGAGCACACG': 1, 'CGAGCAGTAGACTCCT': 1, 'CGCATTGCATTCATCA': 1, 'AGATTGAGGCTGGGAA': 1, 'AGAACGTGAAAAAGCG': 1, 'TCTGATTGTCCAGTTG': 1, 'AACGTACCTTCAAGAA': 1, 'AAGGTTCCCGATCTAA': 1, 'AATCCGACCAATCCCA': 1, 'GTACCTCGCAACGGCT': 1, 'GCCGATACTTGGAACA': 1})
Correcting umis
Traceback (most recent call last):
  File "/home/ubuntu/miniconda/bin/CITE-seq-Count", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniconda/lib/python3.6/site-packages/cite_seq_count/__main__.py", line 474, in main
    bcs_corrected=bcs_corrected,
UnboundLocalError: local variable 'bcs_corrected' referenced before assignment

Any idea?

Hoohm commented 5 years ago

Hello @hisplan

I see that I made some changes that didn't pass running tests and still pushed them. I guess I needed to work on different systems and wanted the latest changes.

I'm gonna work on performance and fixes tomorrow. I'll push some changes by then. Let me know if you see improvements then.

You mentioned 1.4.2 being too slow, would you mind telling me which part was problematic for you?

hisplan commented 5 years ago

I'm passing the 10x v2 whitelist and setting --expected_cells=0. I suspect that CB correction part is slow?

Hoohm commented 5 years ago

Yes, that's the exact issue I want to fix. Trying to not go towards creating an index for it and this is not trivial.

Another quick fix for you would be to use the filtered list from 10x v2. It will go way faster.

hisplan commented 5 years ago

Have you had a chance to work on this issue? I've tried 10x v3 whitelist (millions of barcodes;;) with --expected_cells=0, and it's been running for more than 3 days...

Hoohm commented 5 years ago

Hey @hisplan One quick thing, if you're running a v3 chermistry, I would suggest a cell barcode correction of 0 --bc_collapsing_dist 0 Some barcodes have only on hamming distance between them, so collapsing would be a mistake.

I suspect they have some control code in there because they talk about 3M barcodes but the whitelist has 6M.

Hoohm commented 5 years ago

@hisplan On the performance part, I haven't been able to get a great increase in speed. One more thing I have to test is to create an index and use this instead of a tree for barcode recovery.

This might take a lot of memory though. You can try this branch. I added some filtering of low content cells to speed up things.

hisplan commented 5 years ago

Hi @Hoohm, thanks for your quick reply.

I also had a chance to talk to 10x a few months ago about the whitelist for 10x v3 chemistry. As far as I understand, the whitelist has 6.8M barcodes, but the half of it is for feature barcoding. And I was told that by design there is a mismatch of 2bp between GEX library and feature barcode. I asked if I could get separated lists for GEX and feature barcode libraries, but they said no;;;; I don't know why, but if you could get it, please do let me know :-)

Anyway, given this, would you still suggest --bc_collapsing_dist=0? I've been running with the value 1, but if I recall correctly, I did get a message something like this:

Testing cell barcode collapsing threshold of 1
Value is too high, reducing it by 1
Testing cell barcode collapsing threshold of 0
Using 0 for cell barcode collapsing threshold
Hoohm commented 5 years ago

Interesting. I'm gonna try to get the info as well. The response you got is strange because the mapping between cell barcodes for mRNA and ADTs is here and it has 6M lines... I'm very confused :D

Hmm, the branch I'm working on should skip the cell barcode correction completely if the number is at 0 whereas the older ones don't skip it properly.

Definitely try the brach with this modification. It also stops correcting unmapped UMIs and low UMI cells (1 or 0 UMI for a tag). Both of these filters reduce computation time. It's also reported in the report.

hisplan commented 5 years ago

I didn't fully understand it either. But that was pretty much the response I got from 10x. And yeah, I'm using the exact same file. The filename suggests 3M, but it actually contains 6M (3M for GEX, 3M for feature barcoding).

Anyway, mainly I've been using the official 1.4.2 and the develop branch, but I will try feature/index_whitelist.

hisplan commented 5 years ago

I'm having two problems with the feature/index_whitelist branch.

Problem 1

When I tried to load the matrix file using Seurat 3's Read10X function, it threw this exception:

Error in dimnamesGets(x, value) : 
  invalid dimnames given for “dgTMatrix” object

Problem 2

Using the --dense parameter threw this exception:

Writing dense format output
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1653, in create_block_manager_from_blocks
    mgr = BlockManager(blocks, axes)
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 114, in __init__
    self._verify_integrity()
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 311, in _verify_integrity
    construction_error(tot_items, block.shape[1:], self.axes)
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1691, in construction_error
    passed, implied))
ValueError: Shape of passed values is (5, 819821), indices imply (4, 819821)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/bin/CITE-seq-Count", line 10, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/cite_seq_count/__main__.py", line 483, in main
    filename='dense_umis.tsv')
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/cite_seq_count/io.py", line 48, in write_dense
    pandas_dense = pd.DataFrame(sparse_matrix.todense(), columns=columns, index=index)
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/frame.py", line 424, in __init__
    copy=copy)
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/construction.py", line 167, in init_ndarray
    return create_block_manager_from_blocks([values], [columns, index])
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1660, in create_block_manager_from_blocks
    construction_error(tot_items, blocks[0].shape[1:], axes, e)
  File "/home/ubuntu/miniconda/envs/feature-index-whitelist/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1691, in construction_error
    passed, implied))
ValueError: Shape of passed values is (5, 819821), indices imply (4, 819821)
Hoohm commented 5 years ago

Hello @hisplan where are you on this? Have you been able to run it? I'm wrapping things up for a new relase.

hisplan commented 5 years ago

With the entire 10x v3 barcode whitelist, it’s just taking forever (waited a couple of days but had to kill it eventually).

The workaround that I implemented in our pipeline is:

  1. Uses the 1.4.2-develop branch.
  2. Takes the already corrected barcodes from the scRNA-seq count matrix and uses this as a whitelist.

This workaround finishes in a reasonable time, but I would love to try out the new release.

Hoohm commented 5 years ago

Great! Don't use the V3 whitelist, it's more than double the number of barcodes. For V3 I would suggest not using the full one and just running without it. Another important thing, the cells you will find in the Protein data with CSC is not going to match the RNA cells. You have to use the mapping they provide on their github page.

hisplan commented 5 years ago

I feel bit uncomfortable running without the whitelist. The workaround we have now seems fine.

Anyway, which mapping file are you referring to?

Hoohm commented 5 years ago

This list

You have to map the whitelist you have from the RNA cell barcodes to the Protein cell barcodes. You can use the file I linked here to do that.

If you use the whitelist from the RNA directly for the protein, you are actually getting RNA from a cell and Protein data from another.

hisplan commented 5 years ago

Oh, I’m doing this for hashtag. The scRNA-seq count matrix that I mentioned previously has the error-corrected barcodes which were generated using that list from 10x GitHub repo.

Hoohm commented 5 years ago

I'm not sure I understand. Which error-corrected barcodes are you talking about? The cell barcode of the TAGS barcode?

hisplan commented 5 years ago

The cell barcodes in the scRNA-seq count matrix are already error-corrected. I'm feeding this to CITE-seq-Count via --whitelist (instead of using the whole 10x v3 whitelist). Since CITE-seq-count is working with a smaller set of whitelisted barcodes, the running time was reasonable for me.

By the way, you can close this ticket. I will try out the new release once it's out.

Hoohm commented 5 years ago

Yes, that is a sound strategy to use.

If you use only the RNA part of V3 and adding your own cell hashing protocol, all good. If you use REAP seq, protein data from 10x V3, you need to use the mapping between RNA cell barcodes and the Protein cell barcodes.

I'll close the ticket then :)