"UMI_duplication_count.csv" file and "UMI_dedup_stat.csv" file

shokohirosue commented 5 years ago

If I run sc_exon_mapping, sc_demultiplex and sc_gene_counting separately, in "UMI_duplication_count.csv" file, I get duplication number 1's count 0. This does not happen when I run sc_count_aligned_bam wrapper. Do you have any idea why this could be?

Also, in "UMI_dedup_stat.csv" file, I get 0 for number of filtered gene, UMI A percentage, UMI T percentage, UMI G percentage and UMI C percentage. Is this how it should be?

I would be grateful for any advice!

Thank you, Shoko

LuyiTian commented 5 years ago

can you provide an example of the two different UMI_duplication_count.csv? I think just the first few lines would be enough. There might be some bugs in our summary output module but the gene quantification part should be fine.

shokohirosue commented 5 years ago

The top of UMI_duplication_count.csv produced by sc_exon_mapping, sc_demultiplex and sc_gene_counting is

duplication number	count
1	0
2	7911131
3	6091489
4	4163607
5	2623180
6	1597392
7	987498
8	634213
9	433120
10	313844

The top of UMI_dedup_stat.csv is

cell_id	number of filtered gene	number of corrected UMI	UMI A percentage	UMI T percentage	UMI G percentage	UMI C percentage
CELL_020000	0	220	0	0	0	0
CELL_019994	0	261	0	0	0	0
CELL_019990	0	170	0	0	0	0

The top of UMI_duplication_count.csv produced by sc_count_aligned_bam is

duplication number	count
1	5280116
2	6564473
3	6006580
4	4439189
5	2570016
6	1600715
7	1204597
8	801283
9	534746
10	440353

The top of UMI_dedup_stat.csv is

cell_id	number of filtered gene	number of corrected UMI	UMI A percentage	UMI T percentage	UMI G percentage	UMI C percentage
CELL_020000	0	137	0	0	0	0
CELL_019994	0	158	0	0	0	0
CELL_019990	0	94	0	0	0	0

These are produced from the same inbam, annofn and bc_anno. Do you have any idea why the numbers in the tables are very different?

Thank you.

shokohirosue commented 5 years ago

I will add the code structure I used:

For sc_exon_mapping, sc_demultiplex and sc_gene_counting


read_structure = list(
    bs1 = -1,   # barcode start position in fq_R1, -1 indicates no barcode
    bl1 = 0,    # barcode length in fq_R1, 0 since no barcode present
    bs2 = 0,    # barcode start position in fq_R2
    bl2 = 16,   # barcode length in fq_R2
    us = 16,    # UMI start position in fq_R2
    ul = 10     # UMI length
)

sc_exon_mapping(inbam = "/inbam/Aligned.sortedByCoord.out.bam",
                outbam = "/output1/sample1.mapped.bam",
                annofn = "/gencode.vM19.primary_assembly.annotation.gtf",
    bc_len = read_structure$bl2,
    UMI_len = read_structure$ul)

sc_demultiplex(inbam = "/output1/sample1.mapped.bam",
               outdir = "/output1/",
               bc_anno = "/sample_index/sample1.csv",
               mito="chrM",
               nthreads = 10)

sc_gene_counting(outdir = "/output1/",
                 bc_anno = "/sample_index/sample1.csv"
                 )

For sc_count_aligned_bam

read_structure = list(
    bs1 = -1,   # barcode start position in fq_R1, -1 indicates no barcode
    bl1 = 0,    # barcode length in fq_R1, 0 since no barcode present
    bs2 = 0,    # barcode start position in fq_R2
    bl2 = 16,   # barcode length in fq_R2
    us = 16,    # UMI start position in fq_R2
    ul = 10     # UMI length
)

sc_count_aligned_bam(
    inbam = "/inbam/Aligned.sortedByCoord.out.bam",
    outbam = "/output2/sample1.mapped.bam",
    mito="chrM",
    annofn = "/gencode.vM19.primary_assembly.annotation.gtf",
    bc_len = read_structure$bl2,
    UMI_len = read_structure$ul,
    outdir = "/output2/",
    bc_anno = "/sample_index/sample1.csv"
)

shokohirosue commented 5 years ago

I just checked md5sum of gene_count.csv from both and they are different - which result should I use?

Thank you.

LuyiTian commented 5 years ago

There might be some different default parameters in sc_count_aligned_bam. The sc_count_aligned_bam should just be a wrapper of the three functions. You can use the result generated by sc_count_aligned_bam.

LuyiTian / scPipe

"UMI_duplication_count.csv" file and "UMI_dedup_stat.csv" file #113