line 592 of processing.py gives UnboundLocalError in 1.5.0

Erinke89 commented 3 years ago

Dear @Hoohm, I have been trying to use the feature/barcode_translation branch. I no longer have the problem with --bc_collapsing_dist 0, many thanks for solving that! However, line 592 of processing.py errors out with "UnboundLocalError: local variable 'parallel_results' referenced before assignment". Not sure if correct but I moved the assignment in line 575 parallel_results = [] to the start of the function definition. This gets passed the error but the script is still not working as expected and I have no clue if this is linked to my alteration.

When I run with a reference list, the script finishes without errors but the barcodes.tsv.gz file remains completely empty. Same problem with historic data where -wl used to work for me in v1.4.4. I've tried all kinds of alterations to my reference list but I don't understand what I'm doing wrong. Without a reference list defined, the tool gives reasonable output, but it's not what I need because missing cells won't be accepted when loading the counts into my existing Seurat object.

My command used is: CITE-seq-Count -R1 data.dir/804111_AF10_S2_L002_R1_001.fastq.gz -R2 data.dir/804111_AF10_S2_L002_R2_001.fastq.gz --tags tag_list.csv -cbf 1 -cbl 16 -umif 17 -umil 28 --bc_collapsing_dist 0 --umi_collapsing_dist 1 -rl barcodewhitelist_ch2.csv -n_cells 7523 -T 30 -o ./wl_804111_ch2_AF10.dir

Would much appreciate your help! Many thanks! Erinke

Erinke89 commented 3 years ago

I have solved the issue with the -rl not generating any output by using a reference list with a translation column; even though this doesn't make sense for my data because I have used TotalSeqA antibodies which are captured via a polyA tail and therefore the reads will have the "RNA cell barcode" and not the "Protein cell barcode" and no mapping is needed. Hence, I put the same barcode in the translation column as in the reference column of the reference list. Now it is creating output, but just thought it could be useful to make use of the translation column optional in 1.5.0. Thank you

Hoohm commented 3 years ago

That's a bug, tit should only use the translation if it's given. Thanks for trying it out!

On Thu, 21 Jan 2021 at 13:23, Erinke notifications@github.com wrote:

I have solved the issue with the -rl not generating any output by using a reference list with a translation column; even though this doesn't make sense for my data because I have used TotalSeqA antibodies which are captured via a polyA tail and therefore the reads will have the "RNA cell barcode" and not the "Protein cell barcode" and no mapping is needed. Hence, I put the same barcode in the translation column as in the reference column of the reference list. Now it is creating output, but just thought it could be useful to make use of the translation column optional in 1.5.0. Thank you

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/144#issuecomment-764607468, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2DSWK6DGP7KXXMXNO3S3AMEXANCNFSM4WLBKNRQ .

--

Roelli Patrick Division of Animal Physiology and Immunology TUM School of Life Sciences Weihenstephan Technische Universität München Weihenstephaner Berg 3 85354 Freising Germany

https://github.com/Hoohm https://github.com/Hoohm

Hoohm commented 3 years ago

Latest commit should fix your issue :)

Erinke89 commented 3 years ago

Dear @Hoohm, Thanks for the last commit, this indeed fixed both issues! :) Can I ask you a related question please? Is it on purpose that since 1.5.0 using a reference list no longer forces the output to produce all barcodes from the list? In 1.4.3 this was the case, which made it very easy to select cells based on the RNA QC and load the CITE-seq-Count output as ADT assay into an existing Seurat object. 1.5.0 produces output for only a part of the reference list, and I have not yet been able to find a way around to load that into Seurat...

Hoohm commented 3 years ago

Yes it's on purpose. This is mostly related to the new V3 from 10x that has a reference list of 6M barcodes. The list is now more of a reference on which CSC checks the barcodes given compared to before where it was a subset that CSC had to "seek" in the data. Now on top of the ref list, you would need to give the number of cells you expect.

The main idea is to reduce runtime for cells we probably won't even use.

I haven't released 1.5 yet so I'm still open to suggestions.

The other problem with V3 is that the cell barcodes you would use as a reference in CITE-seq-Count won't be found because you need to "translate" the barcodes first.

I was thinking of adding a new argument which would be a subset of the reference to get exactly what you are talking about, but it seems like this would be confusing to users, not even mentioning if they use a 10xV3 kit.

Erinke89 commented 3 years ago

Thanks for explaining! I understand now also why you changed the name from whitelist to reference list. I indeed prefer the old use of the whitelist but can see the value of using the 6M reference list. I'm trying to think how this would be clear to future users. Obviously, an automatic link to the 10x lists when defining the used chemistry would be very elegant but maybe technically difficult. Possibly the old whitelist function could be renamed to something like "pre-selected cells", explaining this entry is for cells selected via other criteria (mRNA/ other QC/ in my case the GEX list is also depleted of faulty barcodes that resulted from index hopping on the sequencer)..?

Additionally, V3 chemistry does not exclude the use of TotalSeqA antibodies so you would not always need to translate the barcodes. But yes in case of TotalSeqB/C antibody you would need both the preselected cells list and the reference list for the translation... so I guess CSC also needs as an input which type of antibody you've been using..?

Hope this is helpful, many thanks for the constant development of CSC !!

Hoohm commented 3 years ago

I forgot about that distinction (A, B or C chemistries). In theory, this should not be an issue as the new version gives you both barcode list, it's just that sometimes you would need it.

I'm thinking ahead for the next big feature in CSC which I'm working on here: https://github.com/Hoohm/scg_lib_structs/tree/master/chemistries

The new feature would ask the user to provide the chemistry being used and would fill in the blanks for them. Barcode positions, potential reference list, etc...

Another user reported he preferred the old way of handling cells, so maybe I'm going to add the "full reference list" as a new argument and keep the older way as a default.

I have to think about it a bit, but thanks a lot for the feedback.

Btw, did the new version run faster for you? Did you see changes in the requirements of memory?

Hoohm commented 3 years ago

So, I've been thinking about different designs for argument inputs, ease of use, backward compatibility, and similar user interaction, I'm considering keeping the choice between -n_cells and -cell_list (old whitelist).

I still need to add a new input, which would be the new -reference_list to provide the translation when needed.

This decouples the data subsetting from the cell barcode correction and the translation.

Erinke89 commented 3 years ago

To me that sounds like a sensible choice and one that can be well explained to new users!

Version 1.4 already ran really quickly for me (5-10min), so I didn't really notice if the new version was faster and haven't checked memory usage. Sorry for not having more feedback on that!

Hoohm / CITE-seq-Count

line 592 of processing.py gives UnboundLocalError in 1.5.0 #144