Closed J-Moravec closed 4 years ago
Let's first figure out why there are only zeros in Barcodes.stats. The Log.out at the end shows that barcodes were detected. What is the content of Solo.out/Gene/Features.stats and Solo.out/Gene/Summary.csv?
Both seems to be well populated:
Features: https://termbin.com/76zg
Summary: https://termbin.com/42hb
In the Summary, the Reads With Valid Barcode is suspicious, but I presume this is 100% and not literally a single read with a valid barcode.
Right, this is 100% - it means that all the barcodes match the whitelist, i.e. all were error corrected, which should not happen. This explains why most lines in Barcode.stats are zero, but nExactMatch should not be zero - should contain total number of reads.
Can you try to cut ~100k lines from the original BAM, and go with the whole procedure again - convert bam2fastq and then run STARsolo. If it still has this problem, can you share this small BAM, so that I can reproduce the problem.
Thanks! Alex
Done the procedure again, only 100k reads:
samtools view data/HLR100k.bam | wc -l
# 100000
Log.final.out: https://termbin.com/riue
Number of input reads | 47080
Somehow I have lost 47080 reads?
I have checked the fastq files produced by 10Xs bamtofastq
and there truly only 47080 reads! So Unless I don't get something in bamfiles or fastq files or bamtofastq
just ate half of my reads! I almost just went around and used cellranger count
with the bamtofastq
, but that would be a grave error without knowing this!
Never mind. I checked the aligned file out of STAR and it had above 100k reads. So the "duplicated" reads are reads that were mapped on several places. I tried to grep a read from the original file in the new file and couldn't get a match, so possibly some read names (the last two numbers) can change? Or the bamtofastq
changes names?
Also apparently some reads do not have CB tag?
samtools view STARsolo/BAM/HLR100k.Aligned.sortedByCoord.out.bam | wc -l
#102281
but:
samtools view STARsolo/BAM/HLR100k.Aligned.sortedByCoord.out.bam | grep "CB" | wc -l
# 30190
So continuing exploration:
Barcodes.stats are exactly the same, all zeros.
Features: https://termbin.com/8wu4 Summary: https://termbin.com/tc9k
Right, this is 100% - it means that all the barcodes match the whitelist, i.e. all were error corrected, which should not happen.
I don't have whitelist. I presumed that since I am working on already preprocessed data, I don't need it?
Regarding bamtofastq
, here is info from header:
@CO 10x_bam_to_fastq:I1(BC:QT)
@CO 10x_bam_to_fastq:R1(CR:CY,UR:UY)
@CO 10x_bam_to_fastq:R2(SEQ:QUAL)
And first few lines in fastq
from bamtofastq
cat bamtofastq_S1_L001_I1_001.fastq.gz | gunzip | head
@NS500269:554:HHJGLBGX5:1:12305:24406:14142 2:N:0:0
CTAAGTTT
+
AAAAAE6E
@NS500269:554:HHJGLBGX5:1:13102:9918:2719 2:N:0:0
TACCACCA
+
AAAAAEEE
@NS500269:554:HHJGLBGX5:1:13304:15435:15006 2:N:0:0
TACCACCA
cat bamtofastq_S1_L001_R1_001.fastq.gz | gunzip | head
@NS500269:554:HHJGLBGX5:1:12305:24406:14142 1:N:0:0
GTTACAGAGTGACTCTGATCAGCGCC
+
AAAAAEE6EEE6EEE/A/EE/EE/E<
@NS500269:554:HHJGLBGX5:1:13102:9918:2719 1:N:0:0
GATGCTACAACTGGCCTCCCTACCGT
+
AAAAAEEEEEEEEEEEEEEEEEEEEE
@NS500269:554:HHJGLBGX5:1:13304:15435:15006 1:N:0:0
CATGGCGGTAATCACCAGTTTTTCAG
cat bamtofastq_S1_L001_R2_001.fastq.gz | gunzip | head
@NS500269:554:HHJGLBGX5:1:12305:24406:14142 3:N:0:0
ATAAAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTCTG
+
AA6AA6EEEE6//EEE///EAEEEEEEAE6E//E/EAE/E///6EEEEEEEAE/EA
@NS500269:554:HHJGLBGX5:1:13102:9918:2719 3:N:0:0
GGCTTCCAGAGAAAATGGCACACCAATCAATAAAGAACTGAGCAGAAAAAAAAAAA
+
A6AAAEEEEAEEEEEEEEEAEEEAAEEEAEE66EE/EEEAAAE6E/A6AAA6A6E/
@NS500269:554:HHJGLBGX5:1:13304:15435:15006 3:N:0:0
GCGCAACTGTTTACTTAAGAGGCTTCCAGAGAAAAATGCACACAAATCAATAAAGA
Anything suspicious looking?
Btw. thanks again Alex for helping me here.
Aah, I missed that you are were running it without whitelist. I think this explains all the "weird" observations. You need to specify whitelist for proper operation (i.e. very similar to CellRanger), otherwise, the error correction of barcodes is not performed. This looks like 10X v2 data, so you need the 737K-august-2016.txt whitelist from 10X. I have it here http://labshare.cshl.edu/shares/dobin/dobin/STARsolo/10X/whitelists/737K-august-2016.txt.gz
I can confirm that this fixed the problem. Strange that the None value is allowed but stuff breaks with it.
Hi Jiří
does it break or does it run to completion? I think it just produces different results - as expected since without whitelist there is no error-correction of barcodes. This option is used for "classic" non-10X Drop-seq, which usually do not have barcode lists.
Cheers Alex
It runs into completion. What I meant by break is that (as title says) only a single read contained the filtered barcode, which was quite surprising. Even without barcode correction, I would expect more. Or less filtered barcodes. But the number of filtered barcodes remain relatively (if not completely) constant and similar (if not the same) as the number of filtered barcodes obtained from 10x cellranger count
.
You are right - for filtered barcodes, even without the whitelist, you should see many reads in the BAM file.
I do not see this issue in my test. Could you try to just grep this barcode from the BAM file (mapped with --soloCBwhitelist None
), i.e. something like
samtools view Aligned.out.bam | grep AAACCTGAGTGAAGAG
To speed it up, you can map only first 100000 reads with --readMapNumber 100000
Thanks! Alex
Can't replicate the problem any more after the problem with TX:Z
tag from https://github.com/alexdobin/STAR/issues/952 was fixed.
STARsolo
reported correct tax with even with --soloCBwhitelist None
.
I am remapping a BAM file created by 10x cell_ranger (which use STAR internally).
I have converted
BAM
intofastq
files with 10Xbamtofastq
. From these files, R1 seems to contain a barcode, R2 seems to contains a read and I1 contains god knows what.I have fed these into
STARsolo
(in the orderR2 R1
as described in manual) and gotAligned.sortedByCoords.out.BAM
as well as filtered barcodes inSolo.out/Gene/filtered/barcodes.tsv
which should, if I understand everything correctly, contain barcodes (CB) for well-represented cells with a lot of reads.I then used
pysam
to check this, but by running:which should search the bam file and found every single read contained particular barcode. Running this with the first barcode in
filtered/barcodes.tsv
I got... 1. A single read contained this barcode. I am completely baffled by this.Also, the number of barcodes changes from 160 000 (original BAM) to some 260 000 (new BAM). I am also confused by
Barcodes.stats
, I got all zeros: https://termbin.com/6gptsBoth of these would suggest that maybe I didn't do something right?
Log files: Log.out: https://termbin.com/xlgt Log.final.out: https://termbin.com/l1e9
Thank you.