LuyiTian / scPipe

a pipeline for single cell RNA-seq data analysis
68 stars 24 forks source link

"UMI_duplication_count.csv" file and "UMI_dedup_stat.csv" file #113

Closed shokohirosue closed 5 years ago

shokohirosue commented 5 years ago

If I run sc_exon_mapping, sc_demultiplex and sc_gene_counting separately, in "UMI_duplication_count.csv" file, I get duplication number 1's count 0. This does not happen when I run sc_count_aligned_bam wrapper. Do you have any idea why this could be?

Also, in "UMI_dedup_stat.csv" file, I get 0 for number of filtered gene, UMI A percentage, UMI T percentage, UMI G percentage and UMI C percentage. Is this how it should be?

I would be grateful for any advice!

Thank you, Shoko

LuyiTian commented 5 years ago

can you provide an example of the two different UMI_duplication_count.csv? I think just the first few lines would be enough. There might be some bugs in our summary output module but the gene quantification part should be fine.

shokohirosue commented 5 years ago

The top of UMI_duplication_count.csv produced by sc_exon_mapping, sc_demultiplex and sc_gene_counting is

duplication number count
1 0
2 7911131
3 6091489
4 4163607
5 2623180
6 1597392
7 987498
8 634213
9 433120
10 313844

The top of UMI_dedup_stat.csv is

cell_id number of filtered gene number of corrected UMI UMI A percentage UMI T percentage UMI G percentage UMI C percentage
CELL_020000 0 220 0 0 0 0
CELL_019994 0 261 0 0 0 0
CELL_019990 0 170 0 0 0 0

The top of UMI_duplication_count.csv produced by sc_count_aligned_bam is

duplication number count
1 5280116
2 6564473
3 6006580
4 4439189
5 2570016
6 1600715
7 1204597
8 801283
9 534746
10 440353

The top of UMI_dedup_stat.csv is

cell_id number of filtered gene number of corrected UMI UMI A percentage UMI T percentage UMI G percentage UMI C percentage
CELL_020000 0 137 0 0 0 0
CELL_019994 0 158 0 0 0 0
CELL_019990 0 94 0 0 0 0

These are produced from the same inbam, annofn and bc_anno. Do you have any idea why the numbers in the tables are very different?

Thank you.

shokohirosue commented 5 years ago

I will add the code structure I used:

For sc_exon_mapping, sc_demultiplex and sc_gene_counting


read_structure = list(
    bs1 = -1,   # barcode start position in fq_R1, -1 indicates no barcode
    bl1 = 0,    # barcode length in fq_R1, 0 since no barcode present
    bs2 = 0,    # barcode start position in fq_R2
    bl2 = 16,   # barcode length in fq_R2
    us = 16,    # UMI start position in fq_R2
    ul = 10     # UMI length
)

sc_exon_mapping(inbam = "/inbam/Aligned.sortedByCoord.out.bam",
                outbam = "/output1/sample1.mapped.bam",
                annofn = "/gencode.vM19.primary_assembly.annotation.gtf",
    bc_len = read_structure$bl2,
    UMI_len = read_structure$ul)

sc_demultiplex(inbam = "/output1/sample1.mapped.bam",
               outdir = "/output1/",
               bc_anno = "/sample_index/sample1.csv",
               mito="chrM",
               nthreads = 10)

sc_gene_counting(outdir = "/output1/",
                 bc_anno = "/sample_index/sample1.csv"
                 )

For sc_count_aligned_bam

read_structure = list(
    bs1 = -1,   # barcode start position in fq_R1, -1 indicates no barcode
    bl1 = 0,    # barcode length in fq_R1, 0 since no barcode present
    bs2 = 0,    # barcode start position in fq_R2
    bl2 = 16,   # barcode length in fq_R2
    us = 16,    # UMI start position in fq_R2
    ul = 10     # UMI length
)

sc_count_aligned_bam(
    inbam = "/inbam/Aligned.sortedByCoord.out.bam",
    outbam = "/output2/sample1.mapped.bam",
    mito="chrM",
    annofn = "/gencode.vM19.primary_assembly.annotation.gtf",
    bc_len = read_structure$bl2,
    UMI_len = read_structure$ul,
    outdir = "/output2/",
    bc_anno = "/sample_index/sample1.csv"
)
shokohirosue commented 5 years ago

I just checked md5sum of gene_count.csv from both and they are different - which result should I use?

Thank you.

LuyiTian commented 5 years ago

There might be some different default parameters in sc_count_aligned_bam. The sc_count_aligned_bam should just be a wrapper of the three functions. You can use the result generated by sc_count_aligned_bam.