Hoohm / CITE-seq-Count

A tool that allows to get UMI counts from a single cell protein assay
https://hoohm.github.io/CITE-seq-Count/
MIT License
79 stars 44 forks source link

Cell Hashing on Multiome samples #174

Open bassanio opened 1 year ago

bassanio commented 1 year ago

Hi,

I am very much new to the Hashing method. I have got a 10x output using cellranger-arc (has both RNASEQ and ATCseq). I was told the samples are multiplexed using Biolegend hashing Ab and I have been provided with the Ab sequences.

1) How can I use the provided Ab sequences to demultiplex the output of cellranger-arc.

Hoohm commented 1 year ago

Hello @bassanio did you already go through the documentation? If so, could you maybe tell me specifically what you need help with?

mbassalbioinformatics commented 1 year ago

Hi I guess to the same spirit as the previous question.

So in the documentation you outline the structure of R1 with the UMI position and then R2 with the Ab barcode data. You also mention how you provide the tag.csv file which will take the input fq files, and generate counts based on the Ab barcodes provided in the csv. That part makes sense.

Now my question is, where does the barcode info for the HTO come into play? Where do you specify those and where does cite-seq-count deal with that? Do i need to run cite-seq-count twice, once for the Ab barcodes and then a 2nd time for the hto? Or do I make a single csv file with the hto and Ab sequences and let cite-seq-count loose on all of it in 1 go?

(I have 1 file of the format [say hto.csv]...

XXXXXX,hashtag1
YYYYYY,hashtag2

... and a 2nd file of format [say abs.csv]...

AAAAAA,Ab1
BBBBBB,Ab2

are you able to provide pseudo-code/commands as to how to run cite-seq-count for each of hto.csv and abs.csv to get the desired counts required for progressing...?)

The 2nd question, assuming now that we deal with the hto/Ab situation. The next step would require loading this information into Seurat for integration, is that correct?

Hoohm commented 1 year ago

So depending on how your libraries habve been sequenced, you ocan run everythint together. You should have fastqs for ABs and fastqs for HTO.

Does cellranger give you the output you need for the ABs?

If so, you only need to run CSC on the HTO.

You can make a tsg.csv with all your HTO tags and all your AB tags, CSC will try and match all of those on the fastqs you provide.

Pseudo code is very simple.

  1. Take a read from R2, try and match any of the tags provided in the tags.csv from the start of the read (or from the first base given by the -start-trim), if not found, flag as unmapped.
  2. Do some cell aggregation
  3. UMI aggregation
  4. Produce read and umi count matrices

Yes, you need then to load up the results into Seurat to do the demultiplexing.

mbassalbioinformatics commented 1 year ago

I have fq for the ab's and for the hto's seperate to the expression data (ie the fq have been split into the different samples, and each sample has its corresponding ab + hto fq files)

So if i understand you correctly i need to run cellranger on the ab+hto fq separately to get the counts matrix for those, right? and a 2nd run of cellranger on the expression fq files for those counts?

After which i just run CSC on the ab+hto-fq's with

CITE-seq-Count -R1 ab-HTO_R1.fastq.gz -R2 ab-HTO_R2.fastq.gz \
-t TAG_LIST_HTO-Ab.csv -cbf 1 -cbl 16 -umif 17 -umil 26 -cells 20000 -o ./out/

did i understand you correctly?

and from there into R for the rest 👍

Hoohm commented 1 year ago

So, depending on which kit you used from 10x, you can run RNA AB and HTO together. Whatever deviates from the normal protocol will not be compatible with the software. So, if the HTO is not in the kit, you need to run CSC on that part.

On Thu, 2 Feb 2023 at 22:07, Mahmoud A. Bassal @.***> wrote:

Ok, yes i have fq for the ab's and for the hto's seperate (ie the fq have been split into the different sample types, and each sample has its corresponding ab fq files)

So if i understand you correctly i need to run cellranger on the hto fq separately to get the counts matrix for those, right? and a 2nd run of cellranger on the ab fq files for those counts?

After which i just run CSC on the hto-fq's with

CITE-seq-Count -R1 HTO_R1.fastq.gz -R2 HTO_R2.fastq.gz \

-t TAG_LIST_HTO-Ab.csv -cbf 1 -cbl 16 -umif 17 -umil 26 -cells 20000 -o ./out/

did i understand you correctly?

and from there into R for the rest 👍

— Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/174#issuecomment-1414372995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2E5NR5DMQQNVSY7C2TWVQOYZANCNFSM6AAAAAAUIPISNI . You are receiving this because you commented.Message ID: @.***>

--

Roelli Patrick Division of Animal Physiology and Immunology TUM School of Life Sciences Weihenstephan Technische Universität München Weihenstephaner Berg 3 85354 Freising Germany

https://github.com/Hoohm https://github.com/Hoohm

bassanio commented 1 year ago

Hi ,

I have tried to run the citeseq using the below command and I have got the following error.

I am also confused with R2 and R3 because for me I am finding the ABs in the R3 and not in R2.

CITE-seq-Count  \
-R1 hto_S3_L001_R1_001.fastq.gz\
 -R2 hto_S3_L001_R3_001.fastq.gz \
 -t TAGS.txt \
-cbf 1 -cbl 16 -umif 17 -umil 26 -cells 13641 \
-o RESULT

Tag File

ACCCACCAGTAAGAC,First_P1_Undivided
GGTCGAGAGCATTCA,Second_P2_late_dividers
CTTGCCGCATGTCAT,Third_P3_Early_dividers

Executing the above command with Warning and issue error

Read1 length is 51bp but you are using 26bp for Cell and UMI barcodes combined.
This might lead to wrong cell attribution and skewed umi counts.

Counting number of reads
Started mapping
Processing 10,651,191 read
CITE-seq-Count is running with XX cores.
Mapping done for process 2006672. Processed 166,424 reads
Mapping done for process 2006674. Processed 166,424 reads
Mapping done for .......
Mapping done for process 2006731. Processed 166,424 reads
Mapping done
Merging results
Correcting cell barcodes
Looking for a whitelist

Collapsing cell barcodes
Correcting umis
Traceback (most recent call last):
  File "/home/.local/bin/CITE-seq-Count", line 8, in <module>
    sys.exit(main())
  File "/home/.local/lib/python3.9/site-packages/cite_seq_count/__main__.py", line 435, in main
    ) = processing.correct_umis(
  File "/home/.local/lib/python3.9/site-packages/cite_seq_count/processing.py", line 229, in correct_umis
    for TAG in final_results[cell_barcode]:
RuntimeError: dictionary keys changed during iteration

HTO R1 : Screen Shot 2023-05-16 at 11 23 17 AM

HTO R2 : Screen Shot 2023-05-16 at 11 23 39 AM

HTO R3 : Screen Shot 2023-05-16 at 11 24 09 AM

grep AB TAG in R3 :

Screen Shot 2023-05-16 at 11 26 28 AM

Some AB barcodes does not start correctly as shown in the example

cpflueger2016 commented 1 year ago

@bassanio try to setup a conda environment with python version 3.7.16 and run it again. I have had no luck with any python version > 3.7. The error is actually an issue with changes in the pandas package. If you restrict python to 3.7.16, pip install CITE-seq-Count==1.4.5 will pull the correct pandas package version. good luck!

bassanio commented 1 year ago

@cpflueger2016 : Thanks for the information I will do the same.

Can you also help me in understanding in R2 and R3 fastq files

cpflueger2016 commented 1 year ago

Yea, so if you get the index read from the i7 index parsed out (there is an option in bcl2fastq), your read2 is actually the index of the library and read3 is truly the second read.

bassanio commented 1 year ago

@cpflueger2016 : I have this warning message in the top

Read1 length is 51bp but you are using 26bp for Cell and UMI barcodes combined"

Should I change the umil to 51 ? do this has some affect on the analysis

Hoohm commented 1 year ago

This is not going to affect the analysis. Back in the day I wanted to make sure people knew what they were running and catch potential wrong lengths. In hindsight this might have been a mistake as it confuses users more than anything.

Is your general issue resolved, can I close this one?