Closed shokohirosue closed 5 years ago
can you provide an example of the two different UMI_duplication_count.csv
? I think just the first few lines would be enough. There might be some bugs in our summary output module but the gene quantification part should be fine.
The top of UMI_duplication_count.csv
produced by sc_exon_mapping
, sc_demultiplex
and sc_gene_counting
is
duplication number | count |
---|---|
1 | 0 |
2 | 7911131 |
3 | 6091489 |
4 | 4163607 |
5 | 2623180 |
6 | 1597392 |
7 | 987498 |
8 | 634213 |
9 | 433120 |
10 | 313844 |
The top of UMI_dedup_stat.csv
is
cell_id | number of filtered gene | number of corrected UMI | UMI A percentage | UMI T percentage | UMI G percentage | UMI C percentage |
---|---|---|---|---|---|---|
CELL_020000 | 0 | 220 | 0 | 0 | 0 | 0 |
CELL_019994 | 0 | 261 | 0 | 0 | 0 | 0 |
CELL_019990 | 0 | 170 | 0 | 0 | 0 | 0 |
The top of UMI_duplication_count.csv
produced by sc_count_aligned_bam
is
duplication number | count |
---|---|
1 | 5280116 |
2 | 6564473 |
3 | 6006580 |
4 | 4439189 |
5 | 2570016 |
6 | 1600715 |
7 | 1204597 |
8 | 801283 |
9 | 534746 |
10 | 440353 |
The top of UMI_dedup_stat.csv
is
cell_id | number of filtered gene | number of corrected UMI | UMI A percentage | UMI T percentage | UMI G percentage | UMI C percentage |
---|---|---|---|---|---|---|
CELL_020000 | 0 | 137 | 0 | 0 | 0 | 0 |
CELL_019994 | 0 | 158 | 0 | 0 | 0 | 0 |
CELL_019990 | 0 | 94 | 0 | 0 | 0 | 0 |
These are produced from the same inbam
, annofn
and bc_anno
.
Do you have any idea why the numbers in the tables are very different?
Thank you.
I will add the code structure I used:
For sc_exon_mapping
, sc_demultiplex
and sc_gene_counting
read_structure = list(
bs1 = -1, # barcode start position in fq_R1, -1 indicates no barcode
bl1 = 0, # barcode length in fq_R1, 0 since no barcode present
bs2 = 0, # barcode start position in fq_R2
bl2 = 16, # barcode length in fq_R2
us = 16, # UMI start position in fq_R2
ul = 10 # UMI length
)
sc_exon_mapping(inbam = "/inbam/Aligned.sortedByCoord.out.bam",
outbam = "/output1/sample1.mapped.bam",
annofn = "/gencode.vM19.primary_assembly.annotation.gtf",
bc_len = read_structure$bl2,
UMI_len = read_structure$ul)
sc_demultiplex(inbam = "/output1/sample1.mapped.bam",
outdir = "/output1/",
bc_anno = "/sample_index/sample1.csv",
mito="chrM",
nthreads = 10)
sc_gene_counting(outdir = "/output1/",
bc_anno = "/sample_index/sample1.csv"
)
For sc_count_aligned_bam
read_structure = list(
bs1 = -1, # barcode start position in fq_R1, -1 indicates no barcode
bl1 = 0, # barcode length in fq_R1, 0 since no barcode present
bs2 = 0, # barcode start position in fq_R2
bl2 = 16, # barcode length in fq_R2
us = 16, # UMI start position in fq_R2
ul = 10 # UMI length
)
sc_count_aligned_bam(
inbam = "/inbam/Aligned.sortedByCoord.out.bam",
outbam = "/output2/sample1.mapped.bam",
mito="chrM",
annofn = "/gencode.vM19.primary_assembly.annotation.gtf",
bc_len = read_structure$bl2,
UMI_len = read_structure$ul,
outdir = "/output2/",
bc_anno = "/sample_index/sample1.csv"
)
I just checked md5sum of gene_count.csv from both and they are different - which result should I use?
Thank you.
There might be some different default parameters in sc_count_aligned_bam
. The sc_count_aligned_bam
should just be a wrapper of the three functions. You can use the result generated by sc_count_aligned_bam
.
If I run sc_exon_mapping, sc_demultiplex and sc_gene_counting separately, in "UMI_duplication_count.csv" file, I get duplication number 1's count 0. This does not happen when I run sc_count_aligned_bam wrapper. Do you have any idea why this could be?
Also, in "UMI_dedup_stat.csv" file, I get 0 for number of filtered gene, UMI A percentage, UMI T percentage, UMI G percentage and UMI C percentage. Is this how it should be?
I would be grateful for any advice!
Thank you, Shoko