VCCRI / Sierra

Discover differential transcript usage from polyA-captured single cell RNA-seq data
GNU General Public License v3.0
49 stars 17 forks source link

Problems reading genes with a quote (') in the name #37

Closed GeertvanGeest closed 3 years ago

GeertvanGeest commented 3 years ago

Hi,

I'm using Sierra on single cell data of Drosophila, and the gtf contains gene symbols with a quote in the name (e.g. beta'COP).

This results into issues with the function FindPeaks. As a last step, the output table is read back in for filtering, but that results in warnings:

Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string
2: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  number of items read is not a multiple of the number of columns

But more importantly, a truncated file; only the first ~4k lines of ~50k lines are read in.

I don't think you can expect quoted values in the input data table, so I think you can safely change (line 724; count_polyA.R)

 peak.sites <- read.table(peak.sites.file, header = T, sep = "\t",
                            stringsAsFactors = FALSE)

Into:

peak.sites <- read.table(peak.sites.file, header = T, sep = "\t", quote = '',
                         stringsAsFactors = FALSE)

I've sent a pull request

GeertvanGeest commented 3 years ago

This is an issue in all functions reading in a peak.sites file obviously.