alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

UB:Z:- when CB tag has a full barcode #1588

Open yagam-fluent opened 2 years ago

yagam-fluent commented 2 years ago

Hi Alex,

What causes blank (dash) UB tags when the CB tag has a normal barcode sequence? The R1 (barcode+UMI) read looks like any other read, but I noticed that the R2 (cDNA) read is low quality. Is that related?

Thanks, Yigal

alexdobin commented 2 years ago

Hi Yigal,

R2 should not affect it. Could you please send me the SAM line?

Thanks! Alex

yagam-fluent commented 2 years ago

Here's an example of one read with such a result. In this case R2 is actually not low quality. Most of the reads in this file work fine.

BC len: 16 UMI len: 12

R1: @VH00284:5:AAAML5VHV:2:1602:67710:9084 1:N:0:CTCTCTAC+TCTACTCT ATCCCCCAACTGAATCACTCTGCGTTGA + CCCCCCCCCCCCCCCCCCCCCCCCCCCC

R2: @VH00284:5:AAAML5VHV:2:1602:67710:9084 1:N:0:CTCTCTAC+TCTACTCT TCTCGAGTGCTTCTCGACTGACATGGTCCCTTAGATCGGA + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

BAM: VH00284:5:AAAML5VHV:2:1602:67710:9084 256 chr1 28642375 0 21S16M3S * 0 0 TCTCGAGTGCTTCTCGACTGACATGGTCCCTTAGATCGGA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC gx:Z:- NH:i:5 HI:i:3 CB:Z:ATCCCCCAACTGAATC UB:Z:-

alexdobin commented 2 years ago

Hi Yigal,

this read is a multimapper and does not have the genes assigned to it, so the error correction of CB and UMI is not performed. I did not take good care of outputting the error-corrected CB/UB tags for such reads, and it was changing depending on the version. The best way to have CB/UMI for such reads is to add CR/UR in the --outSAMattributes, which will output the barcode sequences from the FASTQ.

Cheers Alex

mbahin commented 2 years ago

Hi Alex,

Do you plan on correcting multi mapping reads BC and UMI in a future version? For us, for example, it would be interesting to know, even for multi mapping reads, whether a proper BC was found.

Another question that is related, is that, from what I can see, there systematically is - for UB when there is - for CB and vice versa. Could you elaborate a bit on that please? I thought the filters were different for both information.

Cheers, Mathieu

alexdobin commented 2 years ago

Hi Mathieu,

if you use a newer versionof STAR and --soloMultiMappers EM / Rescue / PropUnique, the error correction for multi-gene reads will be done and CB will be output. The UMI error correction is only done for reads with valid CB and GX. CR and UR tags contain uncorrected CB and UMI.

mbahin commented 2 years ago

Hi Alex,

Thanks for your quick reply. I'll upgrade from 2.7.9a to 2.7.10a! :) I have 2 more questions.

1) I don't allow multi mappers and one of my recurrent issue is that I have reads that map to genes and related pseudo genes. We are doing scRNA-Seq and have read that most of the pseudo genes are not expressed so we'd like these reads to be assigned to the gene. I've seen in CellRanger documentation that they recommend to use a filtered GTF. At first, I thought this could help with my problem but, actually, trimming the GTF doesn't change the fact that the read sequence will map as well on gene sequence as on pseudo gene one. So it will still be a multi mapper that I don't recover I guess? Thus I don't understand the advantage of the filtered GTF... Or is it if you allow multi mappers?

2) As I said, I don't want to allow multi mappers but I'd like to know for each read if the BC can be matched in the whitelist, even for multi mapping reads. What would you suggest then? Run once without multi mappers allowed and rescue the BC info from another run allowing multi mappers?

Cheers, Mathieu

alexdobin commented 2 years ago

Hi Mathieu,

If a read maps to two locations in the genome, gene and pseudogene, it will be considered multi-gene with full GTF and unique-gene in the filtered GTF, so filtered GTF will solve the issue you are talking about.

When you run STAR with multimappers options, it outputs both multi- and unique counts.