Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

Changing UMI-edit-distance has seemingly no effect #53

Closed seb-mueller closed 5 years ago

seb-mueller commented 6 years ago

Hi Patrick,

Trying to play around the impact of varying the Edit (Hamming) distance I've run the same data set with different UMI-edit-distance settings in the config.yaml. Not sure what to expect really I was still surprised that non of the cell-barcode counts have changed in the *.dge.summary.txt files. Since the Edit distance is passed on to drop-seq tools and to exclude any problems with dropseqPipe, I've tried to run DigitalExpression from version 1.13 directly with varying edit distances as follows (using the macosko dataset, but results were same for other sets as well):

sample=mac_1000_SRR1748411
for EDIT in $(seq 0 3); do
   echo $EDIT
   ~/software/Drop-seq_tools-1.13/DigitalExpression \
     SUMMARY=summary/${sample}_dge.summary.txt.edit${EDIT} \
     OUTPUT=summary/${sample}_umi_expression_matrix.tsv.edit${EDIT} \
     INPUT=data/${sample}_final.bam \
     EDIT_DISTANCE=$EDIT \
     MIN_BC_READ_THRESHOLD=1 \
     NUM_CORE_BARCODES=87 \
     OUTPUT_READS_INSTEAD=false \
     CELL_BARCODE_TAG=XC \
     MOLECULAR_BARCODE_TAG=XM \
     GENE_EXON_TAG=GE \
     STRAND_TAG=GS \
     READ_MQ=10 \
     USE_STRAND_INFO=true \
     RARE_UMI_FILTER_THRESHOLD=0.0 \
     VERBOSITY=INFO \
     QUIET=false \
     VALIDATION_STRINGENCY=STRICT \
     COMPRESSION_LEVEL=5 \
     MAX_RECORDS_IN_RAM=500000 \
     CREATE_INDEX=false \
     CREATE_MD5_FILE=false \
     GA4GH_CLIENT_SECRETS=client_secrets.json
done

The results are showing the first barcodes for brevity:

mac_1000_SRR1748411_dge.summary.txt.edit3

## htsjdk.samtools.metrics.StringHeader
# org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/mac_1000_SRR1748411_dge.summary.txt.edit3 OUTPUT_READS_INSTEAD=false OUTPUT=summary/mac_1000_SRR1748411_umi_expression_matrix.tsv.edit3 INPUT=data/mac_1000_SRR1748411_final.bam CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS EDIT_DISTANCE=3 READ_MQ=10 MIN_BC_READ_THRESHOLD=1 NUM_CORE_BARCODES=87 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json   
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Oct 10 17:31:43 BST 2018

## METRICS CLASS    org.broadinstitute.dropseqrna.barnyard.DigitalExpression$DESummary
CELL_BARCODE    NUM_GENIC_READS NUM_TRANSCRIPTS NUM_GENES
CGTTCTCTCCCC    418305  78517   11974
GTTTTGAGCGAT    339447  73489   16604
TTGCCGTGGAGT    262520  56767   10218
GCGACGACTGCC    274855  55699   10219
CAACGCATCTGA    288624  55612   10559
GCGTTGTCTTTC    254620  54360   9817
...

mac_1000_SRR1748411_dge.summary.txt.edit0

## htsjdk.samtools.metrics.StringHeader
# org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/mac_1000_SRR1748411_dge.summary.txt.edit0 OUTPUT_READS_INSTEAD=false OUTPUT=summary/mac_1000_SRR1748411_umi_expression_matrix.tsv.edit0 INPUT=data/mac_1000_SRR1748411_final.bam CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS EDIT_DISTANCE=0 READ_MQ=10 MIN_BC_READ_THRESHOLD=1 NUM_CORE_BARCODES=87 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json   
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Oct 10 15:23:00 BST 2018

## METRICS CLASS    org.broadinstitute.dropseqrna.barnyard.DigitalExpression$DESummary
CELL_BARCODE    NUM_GENIC_READS NUM_TRANSCRIPTS NUM_GENES
CGTTCTCTCCCC    418305  174290  11974
GTTTTGAGCGAT    339447  126206  16604
TTGCCGTGGAGT    262520  109009  10218
GCGACGACTGCC    274855  108420  10219
CAACGCATCTGA    288624  107455  10559
GACTACCAGAGT    295351  107401  10132
GCGTTGTCTTTC    254620  102031  9817

As of the excerpts above, an edit distance of 0 and 3 gave the same counts per barcode (in fact for the complete list also for distance 1 and 2, which I could send). I find that a rather unlikely result and was wondering if you had similar experiences, maybe something is wrong with the drop-seq-tools or I'm missing something obvious?

Also, the NUM_GENIC_READS and NUM_GENES don't change, however the NUM_TRANSCRIPT does which is odd.

Also note, the knee_plots (which I thought to have changed in the first place) are based on logs/{sample}_hist_out_cell.txt which I thought should also be affected by the Edit-distance, but they don't seem to have it as parameter (only the dge.summary.txt generation rule has it as input parameter):

https://github.com/Hoohm/dropSeqPipe/blob/b91fdce7f84ff941f5894c611d7e32a9772f0b70/rules/map.smk#L135

seb-mueller commented 6 years ago

Just realized that my whole assumption was based on the Hamming distance of barcodes, which I subconsciously assumed, however clearly UMI-edit-distance refers to UMIs obviously. My bad, sorry for the confustion.

However, having done all this test, it's still strange that the NUM_GENES didn't change wheres NUM_TRANSCRIPT more than halved going from 0 to 3 (see above).

I'll still this issue open for a little just in case someone has an explanation.