alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 506 forks source link

STARsolo: Only one read in the BAM file contain filtered barcode #954

Closed J-Moravec closed 4 years ago

J-Moravec commented 4 years ago

I am remapping a BAM file created by 10x cell_ranger (which use STAR internally).

I have converted BAM into fastq files with 10X bamtofastq. From these files, R1 seems to contain a barcode, R2 seems to contains a read and I1 contains god knows what.

I have fed these into STARsolo (in the order R2 R1 as described in manual) and got Aligned.sortedByCoords.out.BAM as well as filtered barcodes in Solo.out/Gene/filtered/barcodes.tsv which should, if I understand everything correctly, contain barcodes (CB) for well-represented cells with a lot of reads.

I then used pysam to check this, but by running:

import pysam

file = "Aligned.sortedByCoords.out.bam"
barcode = <barcode>

with pysam.AlignmentFile(file, "rb") as bam
  reads_with_barcode = [read for read in bam.fetch(until_eof=True) if read.has_tag("CB") and read.get_tag("CB") == barcode]
  len(reads_with_barcode)

which should search the bam file and found every single read contained particular barcode. Running this with the first barcode in filtered/barcodes.tsv I got... 1. A single read contained this barcode. I am completely baffled by this.

Also, the number of barcodes changes from 160 000 (original BAM) to some 260 000 (new BAM). I am also confused by Barcodes.stats, I got all zeros: https://termbin.com/6gpts

Both of these would suggest that maybe I didn't do something right?

Log files: Log.out: https://termbin.com/xlgt Log.final.out: https://termbin.com/l1e9

Thank you.

alexdobin commented 4 years ago

Let's first figure out why there are only zeros in Barcodes.stats. The Log.out at the end shows that barcodes were detected. What is the content of Solo.out/Gene/Features.stats and Solo.out/Gene/Summary.csv?

J-Moravec commented 4 years ago

Both seems to be well populated:

Features: https://termbin.com/76zg

Summary: https://termbin.com/42hb

In the Summary, the Reads With Valid Barcode is suspicious, but I presume this is 100% and not literally a single read with a valid barcode.

alexdobin commented 4 years ago

Right, this is 100% - it means that all the barcodes match the whitelist, i.e. all were error corrected, which should not happen. This explains why most lines in Barcode.stats are zero, but nExactMatch should not be zero - should contain total number of reads.

Can you try to cut ~100k lines from the original BAM, and go with the whole procedure again - convert bam2fastq and then run STARsolo. If it still has this problem, can you share this small BAM, so that I can reproduce the problem.

Thanks! Alex

J-Moravec commented 4 years ago

Done the procedure again, only 100k reads:

samtools view data/HLR100k.bam | wc -l
# 100000

Log: https://termbin.com/qdo9

Log.final.out: https://termbin.com/riue

                          Number of input reads |   47080

Somehow I have lost 47080 reads?

I have checked the fastq files produced by 10Xs bamtofastq and there truly only 47080 reads! So Unless I don't get something in bamfiles or fastq files or bamtofastq just ate half of my reads! I almost just went around and used cellranger count with the bamtofastq, but that would be a grave error without knowing this!

Never mind. I checked the aligned file out of STAR and it had above 100k reads. So the "duplicated" reads are reads that were mapped on several places. I tried to grep a read from the original file in the new file and couldn't get a match, so possibly some read names (the last two numbers) can change? Or the bamtofastq changes names?

Also apparently some reads do not have CB tag?

samtools view STARsolo/BAM/HLR100k.Aligned.sortedByCoord.out.bam | wc -l
#102281

but:

samtools view STARsolo/BAM/HLR100k.Aligned.sortedByCoord.out.bam | grep "CB" | wc -l
# 30190

So continuing exploration:

Barcodes.stats are exactly the same, all zeros.

Features: https://termbin.com/8wu4 Summary: https://termbin.com/tc9k

Right, this is 100% - it means that all the barcodes match the whitelist, i.e. all were error corrected, which should not happen.

I don't have whitelist. I presumed that since I am working on already preprocessed data, I don't need it?

Regarding bamtofastq, here is info from header:

@CO     10x_bam_to_fastq:I1(BC:QT)
@CO     10x_bam_to_fastq:R1(CR:CY,UR:UY)
@CO     10x_bam_to_fastq:R2(SEQ:QUAL)

And first few lines in fastq from bamtofastq

cat bamtofastq_S1_L001_I1_001.fastq.gz | gunzip | head
@NS500269:554:HHJGLBGX5:1:12305:24406:14142 2:N:0:0
CTAAGTTT
+
AAAAAE6E
@NS500269:554:HHJGLBGX5:1:13102:9918:2719 2:N:0:0
TACCACCA
+
AAAAAEEE
@NS500269:554:HHJGLBGX5:1:13304:15435:15006 2:N:0:0
TACCACCA
cat bamtofastq_S1_L001_R1_001.fastq.gz | gunzip | head
@NS500269:554:HHJGLBGX5:1:12305:24406:14142 1:N:0:0
GTTACAGAGTGACTCTGATCAGCGCC
+
AAAAAEE6EEE6EEE/A/EE/EE/E<
@NS500269:554:HHJGLBGX5:1:13102:9918:2719 1:N:0:0
GATGCTACAACTGGCCTCCCTACCGT
+
AAAAAEEEEEEEEEEEEEEEEEEEEE
@NS500269:554:HHJGLBGX5:1:13304:15435:15006 1:N:0:0
CATGGCGGTAATCACCAGTTTTTCAG
cat bamtofastq_S1_L001_R2_001.fastq.gz | gunzip | head
@NS500269:554:HHJGLBGX5:1:12305:24406:14142 3:N:0:0
ATAAAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTCTG
+
AA6AA6EEEE6//EEE///EAEEEEEEAE6E//E/EAE/E///6EEEEEEEAE/EA
@NS500269:554:HHJGLBGX5:1:13102:9918:2719 3:N:0:0
GGCTTCCAGAGAAAATGGCACACCAATCAATAAAGAACTGAGCAGAAAAAAAAAAA
+
A6AAAEEEEAEEEEEEEEEAEEEAAEEEAEE66EE/EEEAAAE6E/A6AAA6A6E/
@NS500269:554:HHJGLBGX5:1:13304:15435:15006 3:N:0:0
GCGCAACTGTTTACTTAAGAGGCTTCCAGAGAAAAATGCACACAAATCAATAAAGA

Anything suspicious looking?

Btw. thanks again Alex for helping me here.

alexdobin commented 4 years ago

Aah, I missed that you are were running it without whitelist. I think this explains all the "weird" observations. You need to specify whitelist for proper operation (i.e. very similar to CellRanger), otherwise, the error correction of barcodes is not performed. This looks like 10X v2 data, so you need the 737K-august-2016.txt whitelist from 10X. I have it here http://labshare.cshl.edu/shares/dobin/dobin/STARsolo/10X/whitelists/737K-august-2016.txt.gz

J-Moravec commented 4 years ago

I can confirm that this fixed the problem. Strange that the None value is allowed but stuff breaks with it.

alexdobin commented 4 years ago

Hi Jiří

does it break or does it run to completion? I think it just produces different results - as expected since without whitelist there is no error-correction of barcodes. This option is used for "classic" non-10X Drop-seq, which usually do not have barcode lists.

Cheers Alex

J-Moravec commented 4 years ago

It runs into completion. What I meant by break is that (as title says) only a single read contained the filtered barcode, which was quite surprising. Even without barcode correction, I would expect more. Or less filtered barcodes. But the number of filtered barcodes remain relatively (if not completely) constant and similar (if not the same) as the number of filtered barcodes obtained from 10x cellranger count.

alexdobin commented 4 years ago

You are right - for filtered barcodes, even without the whitelist, you should see many reads in the BAM file. I do not see this issue in my test. Could you try to just grep this barcode from the BAM file (mapped with --soloCBwhitelist None), i.e. something like samtools view Aligned.out.bam | grep AAACCTGAGTGAAGAG To speed it up, you can map only first 100000 reads with --readMapNumber 100000

Thanks! Alex

J-Moravec commented 4 years ago

Can't replicate the problem any more after the problem with TX:Z tag from https://github.com/alexdobin/STAR/issues/952 was fixed.

STARsolo reported correct tax with even with --soloCBwhitelist None.