Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

Number of Genes/Transcripts wrong in species_plots #37

Closed seb-mueller closed 6 years ago

seb-mueller commented 6 years ago

I have noticed unusual high numbers of genes in the *species_plot_genes.pdf files and wondering if others have the same behaviour? After some debugging I suspect this line to be the culprit:

https://github.com/Hoohm/dropSeqPipe/blob/0fbca71d17458d87026d21b2f11ab071073ff0d8/scripts/plot_species_plot.R#L82

It reads in for example ths dge file: summary/HUMAN/mysample_dge.summary.txt" which is stuctured like this:

...
 ## METRICS CLASS▸·org.broadinstitute.dropseqrna.barnyard.DigitalExpression$DESummary¬
 CELL_BARCODE▸·NUM_GENIC_READS▸NUM_TRANSCRIPTS▸NUM_GENES¬
 GTTAAGCTCAACTCTT▸·90448▸71198▸7749¬
 GTGCTTCTCGGGAGTA▸·86394▸69074▸7319¬
 ACGAGGAGTAGGGACT▸·81724▸64794▸7215¬
...

i.e. having 4 columns, where as only 3 column names are assigned, which then shifts the column names as shown here:

 > head(a)¬
       CELL_BARCODE NUM_GENIC_READS NUM_TRANSCRIPTS NUM_GENES¬
 1 GTTAAGCTCAACTCTT           90448           71198      7749¬
 2 GTGCTTCTCGGGAGTA           86394           69074      7319¬
 3 ACGAGGAGTAGGGACT           81724           64794      7215¬
 4 GACTAACAGGCGCTCT           72773           57258      6978¬
 5 CGGGTCATCGGCGCTA           64090           51103      7039¬
 6 ACGATGTAGCTAGGCA           62087           48764      6677¬
 >   colnames(a)=c("cellBC", "numGenes", "numTranscripts")¬
 > head(a)¬
             cellBC numGenes numTranscripts   NA¬
 1 GTTAAGCTCAACTCTT    90448          71198 7749¬
 2 GTGCTTCTCGGGAGTA    86394          69074 7319¬
 3 ACGAGGAGTAGGGACT    81724          64794 7215¬
 4 GACTAACAGGCGCTCT    72773          57258 6978¬
 5 CGGGTCATCGGCGCTA    64090          51103 7039¬
 6 ACGATGTAGCTAGGCA    62087          48764 6677¬

Unless I'm not mistaken, replacing the line in question to this should fix it:

 colnames(a)=c("cellBC", "numGenicReads", "numGenes", "numTranscripts")

If this bug is confirmed, I'm happy to do the corrections!