Closed mjsteinbaugh closed 10 months ago
Thanks for the report.
Looks like read.table()
chokes on tabulated file ftp://ftp.ensembl.org/pub/release-98/mysql/mus_musculus_core_98_38/external_db.txt.gz
because the file contains inline \r
characters (carriage returns):
curl ftp://ftp.ensembl.org/pub/release-99/mysql/mus_musculus_core_99_38/external_db.txt.gz >external_db.99.txt.gz
gunzip external_db.99.txt.gz
grep -P '\r' external_db.99.txt # no output
curl ftp://ftp.ensembl.org/pub/release-98/mysql/mus_musculus_core_98_38/external_db.txt.gz >external_db.98.txt.gz
gunzip external_db.98.txt.gz
grep -P '\r' external_db.98.txt # 2 lines of output!
Then from R:
extdbs99 <- read.table("external_db.99.txt", header=FALSE, sep="\t", quote="", comment.char="")
dim(extdbs99)
# [1] 446 10
extdbs98 <- read.table("external_db.98.txt", header=FALSE, sep="\t", quote="", comment.char="")
# Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
# line 311 did not have 10 elements
FWIW I can see the \r
characters in the vim
editor: they are displayed as blue ^M
's.
Now I'm not sure what the best course of action is. Notifying Ensembl that they produce db dumps that are hard to parse wouldn't hurt. However, I wouldn't be surprised if they knew that already because they've fixed the problem starting with release 99. It's just that they probably don't bother fixing past releases, which is understandable.
Interestingly it seems that data.table::fread()
can swallow these \r
characters without problem:
library(data.table)
extdbs98 <- fread("external_db.98.txt")
dim(extdbs98)
# [1] 446 10
but I'm hesitant to introduce a dependency on the data.table package. At least I would like to avoid making this a hard dep so I'm going to explore the possibility of making this a soft dep instead (i.e. listed in Suggests
only).
Another possibility would be to clean the tab file before calling read.table()
on it but I'm not sure how to do this in R. Also I don't want to add C code to GenomeInfoDb just for that.
:thinking:
Suggestions welcome.
H.
That's a great find with grep -P '\r' external_db.98.txt
. I figured it was something like that but couldn't figure it out. I dug into how to clean malformed files and pass it in to base R read.table
recently. You can do this by creating a textConnection
. Let me see if I can make a minimal example of this with the Ensembl file mentioned above.
Update: textConnection
approach is no good. We hit the issue attempting to read any source code lines first. data.table does some internal C jiu jitsu before it reads the lines to handle the \r
character.
download.file(
url = "ftp://ftp.ensembl.org/pub/release-98/mysql/mus_musculus_core_98_38/external_db.txt.gz",
destfile = "external_db.98.txt.gz"
)
## This step messes up on the `\r` characters, and returns length 449 instead
## of expected 446. So this won't work.
lines <- readLines("external_db.98.txt.gz")
print(length(lines))
## 449
## Normally you can create a `textConnection` to pass in corrected source lines.
## But the `\r` carriage character is a tricky edge case that messes up the
## number of source lines.
con <- textConnection(lines)
df <- read.table(
file = con,
header = FALSE,
sep = "\t",
quote = "",
comment.char = ""
)
## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## line 311 did not have 10 elements
## Calls: read.table -> scan
print(lines[[1L]])
## "211\tVB_Community_Symbol\t\\N\tKNOWN\t50\tVB Community Symbol\tMISC\t\\N\t\\N\t\\N"
print(lines[[311L]])
## [1] "\\nMore information in there:"
Addressed in GeomeInfoDb 1.38.3 (release) and 1.39.3 (devel). See commit fe81f0d423f07efabb6e61d6c57620171a6b75b7.
Please let me know how that works for you.
Hi Bioconductor team,
I noticed that
getChromInfoFromEnsembl
doesn't correctly support a number of previous Ensembl releases for Mus musculus:It seems like the oldest supported release is 99.
Releases 95 and older hit this error currently:
And releases 96-98 hit this error due to malformed files on the Ensembl FTP server:
Is some of this fixable in GenomeInfoDb, or do we need to file a bug report with Ensembl instead?
Best, Mike