Open Kaj-cyber opened 4 years ago
Hi Kaj,
please post the entire Log.final.out file. The mappability of 50% is not great - it depends on sequencing quality as well as genome assembly. It's a good idea to assess the "baseline" mappability using a bulk RNA-seq dataset from a similar tissue, and trimming its read length to 55b.
For Drop-seq, you need --soloType Droplet
with proper CB and UMI starts/lengths with:
--soloCBstart --soloCBlen --soloUMIstart --soloUMIlen
and no-whitelist operation with --soloCBwhitelist None
Cheers Alex
@alexdobin
Dear Alex,
I used to perform alignment with dog genome (transcript length=80-90 bp) and acquired 70-80%. After carefully reading your answers in several issues, I managed to acquired >70% alignment by modifying 3 parameters as follows:
--outFilterScoreMinOverLread 0.30 \ --outFilterMismatchNoverLmax 0.1 \ --outFilterMultimapNmax 25 \
And the log.out result was as follows: ` Started job on | Sep 03 09:23:53 Started mapping on | Sep 03 09:24:06 Finished on | Sep 03 09:30:49 Mapping speed, Million of reads per hour | 1538.19
Number of input reads | 172191416
Average input read length | 50
UNIQUE READS:
Uniquely mapped reads number | 127866532
Uniquely mapped reads % | 74.26%
Average mapped length | 51.94
Number of splices: Total | 5033819
Number of splices: Annotated (sjdb) | 4193485
Number of splices: GT/AG | 4523960
Number of splices: GC/AG | 53191
Number of splices: AT/AC | 7629
Number of splices: Non-canonical | 449039
Mismatch rate per base, % | 0.76%
Deletion rate per base | 0.03%
Deletion average length | 1.34
Insertion rate per base | 0.02%
Insertion average length | 1.28
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 18104863
% of reads mapped to multiple loci | 10.51%
Number of reads mapped to too many loci | 839570
% of reads mapped to too many loci | 0.49%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 119710 % of reads unmapped: too many mismatches | 0.07% Number of reads unmapped: too short | 17741427 % of reads unmapped: too short | 10.30% Number of reads unmapped: other | 7519314 % of reads unmapped: other | 4.37% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%`
From FastQC result of the transcript read, I noticed poor Per base sequences of the first 5 bp of the transcript. I don't know exactly how to remove them using Drop-seq tool. I somehow highly suspected them as the source of poor alignment using default STAR parameters.
However, STARsolo rendered me a better result without any parameter modification!! It also allowed me to cut/trim the transcript reads with my normally used tools (flexbar, cutadapt). From your recommended setting, I got the alignment result as follows: ` Started job on | Sep 03 17:45:57 Started mapping on | Sep 03 17:46:12 Finished on | Sep 03 17:53:32 Mapping speed, Million of reads per hour | 1266.81
Number of input reads | 154831743
Average input read length | 47
UNIQUE READS:
Uniquely mapped reads number | 119135101
Uniquely mapped reads % | 76.94%
Average mapped length | 47.39
Number of splices: Total | 4187119
Number of splices: Annotated (sjdb) | 3550144
Number of splices: GT/AG | 3801461
Number of splices: GC/AG | 41066
Number of splices: AT/AC | 6309
Number of splices: Non-canonical | 338283
Mismatch rate per base, % | 0.80%
Deletion rate per base | 0.03%
Deletion average length | 1.30
Insertion rate per base | 0.02%
Insertion average length | 1.21
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 13894435
% of reads mapped to multiple loci | 8.97%
Number of reads mapped to too many loci | 1576860
% of reads mapped to too many loci | 1.02%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 17843723 % of reads unmapped: too short | 11.52% Number of reads unmapped: other | 2381624 % of reads unmapped: other | 1.54% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%`
I do prefer STARsolo as the aligner! After the test running, I tried to extract the expression matrix of all cells in a file using --quantMode GeneCounts
option. However, it gave me like sum count overall values of the sam file.
#### Would you please guide me a bit further how to get such expression matrix (it's my last request for this issue, please feel free to close it after that)?
Best regards, Kaj
Hi Kaj,
--outFilterScoreMinOverLread 0.30
is somewhat dangerous, as you will be letting in some poor-quality (short) alignments.
I would suggest trimming the 5 poor-quality bases with --clip5pNbases 5
, while keeping --outFilterScoreMinOverLread
as large as possible, ideally at its default 0.66
To get the expression matrix, you need to use --soloType Droplet --soloCBstart <> --soloCBlen <> --soloUMIstart <> --soloUMIlen <> --soloCBwhitelist None
with with proper CB and UMI starts/lengths.
Cheers Alex
Dear Alex,
I would like to ask for your kind recommendation. I'm working with canine Drop-seq data by which each transcript read has 55 bp. The length of barcode+UMI is 20 bp. The example data was included as follows (https://sra-download.ncbi.nlm.nih.gov/traces/sra51/SRR/010756/SRR11014432) According to Drop-seq cookbook which recommened default setting of STAR during alignment, it renders alignment percentage around 50%
Would you please suggest me how to adjust parameters to obtain the better result? (I've never worked with aligment with barcode before and totally have no idea where should I start)
With all due respects, I'm also wondering if it is possible to apply STARsolo with the Drop-seq data? Due to my lack of knowledge and limited experience, I try so hard to find an example for such application without success.
Your kind suggestion would be sincerely appreciated.
Best regards, Kaj