alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 503 forks source link

custom reference alignment query #1665

Open biohiran opened 1 year ago

biohiran commented 1 year ago

Hi Alex,

I am working on getting human endogenous virus expression in single cell sample set using STARSolo and further downstream analysis. I have set up the work flow as follows:

  1. Building the reference. I have made an exclusive fasta sequences for full length hervs annotated based on some literature search and in this case i have custom made a gtf based on the sequence length for each sequence.(basically each entry in fasta corresponds to the gene information. The size comes nearly 2.2GB and a STAR index has built based on these data
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir HERV \
--genomeFastaFiles ERV.fa --sjdbGTFfile ERV.gtf
  1. Then I ran STAR solo as follows to map to this ref index.

STAR --runThreadN 19 --soloType Droplet \ --soloCBwhitelist 737K-august-2016.txt --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM \ --outSAMtype BAM SortedByCoordinate --genomeDir HERV --outFileNamePrefix {sample} --readFilesCommand zcat \ --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 --outMultimapperOrder Random --runRNGseed 777 --soloMultiMappers EM --readFilesIn {','.join(files2)} {','.join(files1)}

I have set soloMultiMappers EM so that as far as I understood I would get a relative expression for each herv in matrix format(considering the multimapped reads).

My queries are:

  1. I am seeing different publications for similar analysis were used different number for outFilterMultimapNmax and winAnchorMultimapNmax (like 10, 100 and in some paper it's set as 1000). How can I decide what number would be best for these parameters?
  2. When I did my downstream analysis, that is mapping these EM based matrix resulted by STAR solo to regular GEX data using Seurat I do see the herv expression all over the clusters irrespective of the samples used. Can't really distinguish the expression in clusters. Screen Shot 2022-09-27 at 2 22 00 PM

The only filtering I have done is regular cell filtering we do inSeurat GEX analysis. Do you think is there any flaw in this workflow that resulting to sparse results. ?

Thanks, Hiran

alexdobin commented 1 year ago

Hi Hiran,

first we need to look at the overall mapping statistics in the Log.finall.out file.

biohiran commented 1 year ago

Hi Alex,

below is the Log for one sample. I have used the STAR2.7.10a_alpha_220506 version.


                                 Started job on |       Oct 05 21:43:30
                             Started mapping on |       Oct 05 21:43:33
                                    Finished on |       Oct 05 22:08:26
       Mapping speed, Million of reads per hour |       170.66

                          Number of input reads |       70774454
                      Average input read length |       98
                                    UNIQUE READS:
                   Uniquely mapped reads number |       300705
                        Uniquely mapped reads % |       0.42%
                          Average mapped length |       93.51
                       Number of splices: Total |       16344
            Number of splices: Annotated (sjdb) |       0
                       Number of splices: GT/AG |       14333
                       Number of splices: GC/AG |       382
                       Number of splices: AT/AC |       14
               Number of splices: Non-canonical |       1615
                      Mismatch rate per base, % |       6.38%
                         Deletion rate per base |       0.10%
                        Deletion average length |       1.81
                        Insertion rate per base |       0.06%
                       Insertion average length |       1.58
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       104550
             % of reads mapped to multiple loci |       0.15%
        Number of reads mapped to too many loci |       0
             % of reads mapped to too many loci |       0.00%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       70356241
                 % of reads unmapped: too short |       99.41%
                Number of reads unmapped: other |       12958
                     % of reads unmapped: other |       0.02%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Thanks in advance. Hiran

alexdobin commented 1 year ago

Hi Hiran,

the mapping rate is very low, both for unique and multi-mappers. Also, the mismatch error rate is high. It's likely that the reference you are using is not good for the data you are trying to map.

biohiran commented 1 year ago

thanks Alex.

I am revisiting my genome index build step. (Do you also suggest the reason of poor quality mapping due to any issue in this step?) My read-length is 98, but I have used default "sjdbOverhang ie; 100" for genome index building. I have read that this value should be (Readlength-1). also the gtf that I used is a custom one, that is length of each fasta sequence is mentioned as gene info entries in GTF (since those are hervs and gene start end location was not available for each fasta).

Do you think any of these can cause this issue, or anything totally different?

Also running with dataset in GEO set to see if it's any quality issue of the data.

Hiran

alexdobin commented 1 year ago

Hi Hiran,

column3 == gene lines are not used by STAR, so it should not affect the results. Non-ideal --sjdbOverhang does not affect the results strongly.

biohiran commented 1 year ago

Thanks for re-assuring these points. The public dataset I have tried also have the same mapping metrics. So I am back to square one .

Re-iterating your lines "It's likely that the reference you are using is not good for the data you are trying to map", since I have used a small subsets of annotated hervs as reference, and if the sample doesn't have enough transcripts to map, do you think this could happen?

Other words, what could decided "mismatch error rate" in this case?

Thanks for the help.

alexdobin commented 1 year ago

There may be two issues with the reference you are using. The overall low mapping rate means that most reads in the sample are not HERVs, and thus they do not have a place to map. To mitigate this issue, you can map to a combined reference - human genome + extra HERVs. The high mismatch rate of the mapped reads indicates that the HERV sequences in the sample do not match well with the HERV sequences in your reference. You may need to look for a more complete database of HERV sequences.

biohiran commented 1 year ago

Thanks Alex. I am going to try your suggestions. One thing , when I make human genome+extra HERVs, the mapping count won't affect too much (Considering extra hERVS are originally taken from human genome ) especially if I am using the soloMultiMappers EM option.Is that right?

alexdobin commented 1 year ago

Hi @biohiran

If the HERV loci in the human genome are not annotated, Solo counts won't be affected since it only counts reads that overlap annotations.