alexdobin / STAR

RNA-seq aligner
MIT License
1.78k stars 498 forks source link

Alignment of scRNAseq to Proviral sequences #1961

Open codeneeded opened 9 months ago

codeneeded commented 9 months ago

Experiment: Single Cell RNAseq from 10x genomics File Format: I have 2 Lanes (L001 and L002) and I also have for each of the lanes an R1 and R2 file Reference Genome: Multiple HIV proviral sequences Description: I have already performed alignment and analysis of the Human Reference genome using the Cell Ranger Pipeline followed by downstream analysis using Seurat. In my single-cell data, I want to know if any of the cells are HIV infected. In the absence of a DNA-based approach, I want to try to see if any of the cells are producing proviral RNA. My thought process is, if the cells are producing viral transcripts then they must be infected cells.

Question 1: Is this feasible using STAR aligner? I would want to align them and then obtain the cell barcodes of the cells producing viral mrna and then identify those in seurat.

I have the HXB2 common HIV reference genome as well as the annotation file from NCBI I also have the actual proviral sequences from this particular cohort from a different experiment, without any annotation files (GTF).

Question 2: Can and should I use the GTF annotation file from the HXB2 HIV reference genome for other HIV sequences (HIV is highly mutable leading to variation in different proviral sequences)

The HIV proviral sequences are formatted with all 50 proviral sequences in a single .fas file formatted as follows;

>CK152_CP030TP1INTACTInfant7 GCGCCCGAACAGGGACCTGAAAACGAAAGTAAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAAGTGCACTCGGCAAGAGGCGAGAGCGGCGGCTGGTGAGTACGCCAAATTTTATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCGTCAATACTAAGAGGGGAAAAATTAGATGCATGGGAAAAAATTAAGTTAAGGCCAGGGTCAAAGAAACGCTATATGATAAAACACATAGTATGGGCAAGCAGGGAGCTGGAAAGATTTGCACTTAACCCTGGTCTTTTAGAGACATCAGAAGGCTGTAAACAAATACTAAGACAGCTACATCCAGCTATTCAGACAGGAACAGAGGAACTTAGATCACTATATAACACAGTAGCAGTTCTCTATTGTGTACATGAAAAAATAGAGGTCCGAGACACCAAGGAAGCCTTAGACAAGATAGAGGAAGAACAAAACAAAATTAGGCAAAAAACACAGGTAGCAGAAGCGGCTGAAAAAGAAAAGGTCAGTCAAAATTATCCTATAGTACAGAATCTCCAAGGGCAAATGGTACACCAGGCCATATCACCAAGAACTTTGAATGCATGGGTAAAAGTAGTAGAGGAAAAGGGTTTCAGCCCAGAAGTAATACCCATGTTCACGGCACTATCAGATGGAGCTACCCCACAAGACTTAAATTCCATGTTAAATATAATAGGGGGACATCAAGCAGCCATGCAAATGTTAAAAGATACCATTAATGAGGAGGCTGCCGACTGGGATAGGGTACATCCAGCACAGGCCGGGCCTGTTCCACCAGGCCAAGTAAGAGACCCGAGGGGAAGTGACATAGCAGGAACTACTAGTACCCTTCAAGAACAGATAGCATGGATGACAC..... >CK158_CP030TP1InferredIntactInfant7 TGAAAGCGAAAGTAAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAAGTGCACTCGGCAAGAGGCGAGAGCGGCGGCTGGTGAGTACGCCAAATTTTATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCGTCAATACTAAGAGGGGAAAAATTAGATGCATGGGAAAAAATTAAGTTAAGGCCAGGGTCAAAGAAACGCTATATGATAAAACACATAGTATGGGCAAGCAGGGAGCTGGAAAGATTTGCACTTAACCCTGGTCTTTTAGAGACATCAGAAGGCTGTAAACAAATACTAAGACAGCTACATCCAGCTATTCAGACAGGAACAGAGGAACTTAGATCACTATATAACACAGTAGCAGTTCTCTATTGTGTACATGAAAAAATAGAGGTCCGAGACACCAAGGAAGCCTTAGACAAGATAGAGGAAGAACAAAACAAAATTAGGCAAAAAACACAGGTAGCAGAAGCGGCTGAAAAAGAAAAGGTCAGTCAAAATTATCCTATAGTACAGAATCTCCAAGGGCAAATGGTACACCAGGCCATATCACCAAGAACTTTGAATG.... >CK162_CP030TP3INTACTInfant7 GCGCCCGAACAGGGACTTGAAAGCGAAAGTAAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAAGTGCACTCGGCAAGAGGCGAGAGCGGCGGCTGGTGAGTACGCCAAATTTTATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCGTCAATACTAAGAGGGGAAAAATTAGATGCATGGGAAAAAATTAAGTTAAGGCCAGGGTCAAAGAAACGCTATATGATAAAACACATAGTATGGGCAAGCAGGGAGCTGGAAAGATTTGCACTTAACCCTGGTCTTTTAGAGACATCAGAAGGCTGTAAACAAATACTAAGACAGCTACATCCAGCTATTCAGACAGGAACAGAGGAACTTAGATCACTATATAACACAGTAGCAGTTCTCTATTGTGTACATGAAAAAATAGAGGTCCGAGACACCAAGGAAGCCTTAGACAAGATAGAGGAAGAACAAAACAAAATTAGGCAAAAAACACAGGTAGCAGAAGCGGCTGAAAAAGAAAAGGTCAGTCAAAATTATCCTATAGTACAGAATCTCCAAGGGCAAATGGTACACCAGGCCATATCACCAAGAACTTTGAATGCATGGGTAAAAGTAGTAGAGGAAAAGGGTTTCAGCCCAGAAGTAATACCCATGTTCACGGCACTATCAGATGGAGCTACCCCACAAGACTTAAATTCCATGTTAAATATAATAGGGGGACATCAAGCAGCCATGCAAATGTTAAAAGATACCATTAATGAGGAGGCTGCCGACTGGGATAG....

With multiple sequences in the same file. I want to see if my sequence aligns to ANY of the proviral sequences. (I added the stars next to the > because the markdown is converting it into a block quote otherwise.)

Question 3: Do I need to separate each HIV proviral sequence and perform the reference alignment 30+ times or can I put all of them in the same file and will STAR check if it aligns to any of the sequences in the reference file?

Question 4: The proviral sequences are roughly 8000-12000 characters in length. What are the optimal settings to use, and how would I minimize run time? I have about 10 samples I would like to run this on in total.

Thank you for your help, and sorry if any of this is unclear or not formatted correctly, I will be happy to clarify if so.

alexdobin commented 9 months ago

Hi @codeneeded

my recommendation is to add all the viral sequences to the human genome, and mapped to the combined reference. This is the most accurate and fastest approach. If you are not interested in specific genes expressed by the viruses (just the overall expression from the viral chromosomes), you can add each viral sequence as a single exon/transcript/gene line to the human GTF. Since many reads will likely map to multiple viral sequences, you will need to use multimapper option in STARsolo.

codeneeded commented 9 months ago

Thanks for your response;

  1. 'my recommendation is to add all the viral sequences to the human genome, and mapped to the combined reference. '

How do I do this? Do I simply append each sequence to the end of the previous one with the human reference being the first one?

  1. What do I use for the GTF annotation (the human reference or an HXB2 HIV reference or do I need to combine both?

  2. I need to find the cells that align to the HIV proviral sequences as they are transcriptionally active infected cells. How would I be able to tell that from this as cells that have aligned/mapped will have both human and proviral alignments? I would want an output of the barcodes of cells that have aligned to the HIV proviral sequences.

alexdobin commented 9 months ago

When generating the genome index, you can list multiple FASTA file in --genomeFastaFiles.

You need to create one GTF line for each of the viral sequences. Please google the description of GTF file format.

The count matrix will contain both human genes, and the viral "genes" (entire chromosomes) that you added to GTF files.

codeneeded commented 9 months ago

When generating the genome index, you can list multiple FASTA file in --genomeFastaFiles.

Thank you. So I currently have a single fasta file with all the proviral sequences. Based on this I should split this into multiple files with one sequence per file and then input with the genomeFastaFiles command.

You need to create one GTF line for each of the viral sequences. Please google the description of GTF file format. The count matrix will contain both human genes, and the viral "genes" (entire chromosomes) that you added to GTF files.

Not all the viral sequences have annotation. HIV is very mutable so while the annotation for HXB2 (the most common strain) is available, there isnt one for all the other proviral sequences we identified- is this an issue?

Less than 2% of the cells will be infected. This means that 98% of the cells will align to human reference while the other 2% will potentially align to both human and viral. I'm not interested in the viral genes, I'm interested in identifying the cells that are producing them. Thus what I need is a way to differentiate the barcodes of the cells that aligned to virus+human to the ones that only aligned to the human.

Thanks again so much for your help, and sorry if there is any lack of clarity.

alexdobin commented 9 months ago

Instead of using gene annotations for viral sequences, I suggest using the entire viral chromosomes as genes. Then all reads mapping to viral sequences will be counted towards these "genes". This can be done by adding to the GTF file one line per viral chromosome, with start=1 and end=length. With such modified GTF, the count matrix will contain extra genes corresponding to the viral sequences. Then you can extract the cells which have non-zero counts for these viral "genes".

codeneeded commented 9 months ago

So if I'm understanding this; Step 1; Take the human reference genome and each of the proviral sequences and construct the reference using --genomeFastaFiles to read in multiple files Step 2: Use the normal Human GTF annotation file but append it with the entire proviral sequences, one on each line, with start=1 and end=length. Each will be exactly the same as the proviral fasta used in step 1. Step 3: Align with StarSolo. If any of the viral mRNA aligns (partially) with the viral genome it will be labeled as the viral gene. I use this to determine infected cells (cells with non-zero counts)

Is this correct? I'm wondering given that the mRNA found in the sequences will be much smaller than the viral sequence we set as a gene, will it be able to accurately detect and label them as that viral 'gene'?

Thank you so much, I will get started with this.

codeneeded commented 8 months ago

I mainly wondering about Step 3: Align with StarSolo. If any of the viral mRNA aligns (partially) with the viral genome it will be labeled as the viral gene. I use this to determine infected cells (cells with non-zero counts)

Because if we are selecting the entire viral genome as a 'gene', then the alignment will be to a very small portion of this. Do I need to use any specific settings or will this be fine?

alexdobin commented 8 months ago

This is fine, the reads are expected to be smaller than the gene the align to.

codeneeded commented 7 months ago

Ok, so I built the combined genome and the reference and ran it in star solo mode after building the index. I ran into the error;

EXITING because of FATAL ERROR in input read file: the total length of barcode sequence is 101 not equal to expected 26
Read ID=@A00773:242:HL372DSX3:1:1101:2211:1000 ;  Sequence=NTCATTTTCTTACCTAGACCCACCGTTTTCTTATATGGGACGCCTCCTCCAAGTCCCAGCGAACCCGCGTGCAACCTGTCCCGACTCTAGCCGCCTCTTCA
SOLUTION: check the formatting of input read files.
If UMI+CB length is not equal to the barcode read length, specify barcode read length with --soloBarcodeReadLength
To avoid checking of barcode read length, specify --soloBarcodeReadLength 0
Nov 21 13:59:21 ...... FATAL ERROR, exiting

I am using data generated from 10x 5'v2 chemistry, order of input of files is R2 then R1. The UMI and CB seem to be appropriately 16 and 10 which are the same as star defaults. How do I trouble shoot this?

This is how my R1 looks internally;

@A00773:242:HL372DSX3:1:1101:2211:1000 1:N:0:AGAATGGTTT+TCCCACCCTC NTCATTTTCTTACCTAGACCCACCGTTTTCTTATATGGGACGCCTCCTCCAAGTCCCAGCGAACCCGCGTGCAACCTGTCCCGACTCTAGCCGCCTCTTCA +

:FFFFFFFFFFF,FFFFFFFFFFFFFFFF:FFFFF:F:F:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF

@A00773:242:HL372DSX3:1:1101:3025:1000 1:N:0:AGAATGGTTT+TCCCACCCTC NGCTCGATCCAAATGCTCACTTTCTTTTTCTTATATGGGGCTTTTTCGCAACGGGTTTGCCGCCAGAACACAGGTGTCGTGAAAACTACCCCTAAAAGCCA +

F,FFF:F,FFF:FFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFF:FFFFFFFF:FF:FFFFFFFFFF,FFFFFFFF,F:FFF:

@A00773:242:HL372DSX3:1:1101:4074:1000 1:N:0:AGAATGGTTT+TCCCACCCTC NGGCCAGGTACTTCTTCTAAAGGATTTTTCTTATATGGGGGTTAGCGTCGGCGGCTTTTGGCATGGCGACTTTTTCTGGCCCGGCTGGGCCAATCCTGTCG +

and this is how my R2 looks

@A00773:242:HL372DSX3:1:1101:2211:1000 2:N:0:AGAATGGTTT+TCCCACCCTC AGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTAGTCAAAGCTTTTTTATTATAATTAAAATTATTTAAAATAAAAAGGAATATAGA + FFFFFFFFF:FFFFF:FFFF,FFFFFFFFFFF:FFFFFFFFFFFF::,,:,F:F,,::FF,FFF:,,FFFF,F,FFF:,F,F,,,,FF::,,,F:,,,,,, @A00773:242:HL372DSX3:1:1101:3025:1000 2:N:0:AGAATGGTTT+TCCCACCCTC ATGTCCCTGTAATCATGTTTTTGATAAAGTCTCTGTGTCCTGGGGCATCAATGATAGTCACATAGTACTTGCTGGTCTCAAATTTCCACAAGGAGATATCA + FFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFF:FFFFFFFFFFFFFFFF,FFFFFFF,FF:FFFF::,FFF:FFFFFF:FFFFFFFFFFFFFFF @A00773:242:HL372DSX3:1:1101:4074:1000 2:N:0:AGAATGGTTT+TCCCACCCTC CAAGTAGGTTAGGTAGGTGGCGCACATAGACTACTCCAGGAGTAAGTTGTTCTTGTTTTTTTTCGCTGGGTTATGCGCTTGCGAACCTGCGCCACCTCCTT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @A00773:242:HL372DSX3:1:1101:4508:1000 2:N:0:AGAATGGTTT+TCCCACCCTC GCGGGAAGCTGATGCGCAGGTTGAAGGCTTTCAGGTGGTAGGGAGGTTGGTCGGGTAGGAGGAGAGCGTGCCACACCAGGACATTGGCATCATCGCTGGAC +

Also, for the above -> aligning to HIV-Human combined Reference genome and and using the HIV entire genomes as Genes in the GTF, will using a 2 pass approach benefit me? Also you instructed to use the multimapper option, which should I use? In the current I just tried uniform- is this fine? I would idealy want some kind of cuttoff that would reduce false positives or have some degree of surity that the cells that align to HIV genome are actually HIV+ -> Are there stricter settings in STAR that I should use?

alexdobin commented 7 months ago

Please check the STARsolo parameters you need for 10X-5' protocol: https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md#barcode-and-cdna-on-the-same-mate

2-pass approach is only beneficial if you need more accurate quantification of novel junctions.

--soloMultiMappers EM is most commonly used. Uniform distributes each multimapping read into all genes, not trying to decide on a unique gene for each read.