Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
30 stars 13 forks source link

`getChromInfoFromEnsembl` is erroring for some previous Mus musculus releases #98

Closed mjsteinbaugh closed 10 months ago

mjsteinbaugh commented 11 months ago

Hi Bioconductor team,

I noticed that getChromInfoFromEnsembl doesn't correctly support a number of previous Ensembl releases for Mus musculus:

It seems like the oldest supported release is 99.

GenomeInfoDb::getChromInfoFromEnsembl(species = "Mus musculus", release = 99L, as.Seqinfo = TRUE)
Seqinfo object with 258 sequences (1 circular) from GRCm38.p6 genome:
  seqnames     seqlengths isCircular    genome
  GL456174.2     65061235       <NA> GRCm38.p6
  GL456393.1        55711       <NA> GRCm38.p6
  GL456083.2     19433268       <NA> GRCm38.p6
  GL456190.1       293915       <NA> GRCm38.p6
  GL456368.1        20208       <NA> GRCm38.p6
  ...                 ...        ...       ...
  MG3656_PATCH     245130       <NA> GRCm38.p6
  MG4214_PATCH     203377       <NA> GRCm38.p6
  MG3683_PATCH     155470       <NA> GRCm38.p6
  MG3686_PATCH     305464       <NA> GRCm38.p6
  MG104_PATCH      460895       <NA> GRCm38.p6

Releases 95 and older hit this error currently:

GenomeInfoDb::getChromInfoFromEnsembl(species = "Mus musculus", release = 95L, as.Seqinfo = TRUE)
Error in .find_core_db_in_Ensembl_FTP_core_dbs(mysql_url, species, release = release) :
  found 0 or more than 1 subdir for "Mus musculus" species at
  ftp://ftp.ensembl.org/pub/release-95/mysql/
Calls: <Anonymous> ... get_Ensembl_FTP_core_db_url -> .find_core_db_in_Ensembl_FTP_core_dbs
Backtrace:
    ▆
 1. └─GenomeInfoDb::getChromInfoFromEnsembl(...)
 2.   └─GenomeInfoDb:::get_Ensembl_FTP_core_db_url(...)
 3.     └─GenomeInfoDb:::.find_core_db_in_Ensembl_FTP_core_dbs(...)

And releases 96-98 hit this error due to malformed files on the Ensembl FTP server:

GenomeInfoDb::getChromInfoFromEnsembl(species = "Mus musculus", release = 96L, as.Seqinfo = TRUE)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  line 311 did not have 10 elements
Calls: <Anonymous> ... simple_read_table -> do.call -> do.call -> <Anonymous> -> scan
Backtrace:
     ▆
  1. └─GenomeInfoDb::getChromInfoFromEnsembl(...)
  2.   └─GenomeInfoDb:::.get_chrom_info_from_Ensembl_FTP(...)
  3.     └─GenomeInfoDb:::fetch_seq_regions_from_Ensembl_FTP(...)
  4.       └─GenomeInfoDb:::.fetch_synonyms_from_Ensembl_FTP(...)
  5.         └─GenomeInfoDb:::fetch_table_from_Ensembl_FTP(core_db_url, "external_db")
  6.           └─GenomeInfoDb:::fetch_table_from_url(...)
  7.             └─GenomeInfoDb:::simple_read_table(destfile, ...)
  8.               ├─BiocGenerics::do.call(read.table, args)
  9.               ├─base::do.call(read.table, args)
 10.               └─utils (local) `<fn>`(...)
 11.                 └─base::scan(...)

Is some of this fixable in GenomeInfoDb, or do we need to file a bug report with Ensembl instead?

Best, Mike

hpages commented 10 months ago

Thanks for the report.

Looks like read.table() chokes on tabulated file ftp://ftp.ensembl.org/pub/release-98/mysql/mus_musculus_core_98_38/external_db.txt.gz because the file contains inline \r characters (carriage returns):

curl ftp://ftp.ensembl.org/pub/release-99/mysql/mus_musculus_core_99_38/external_db.txt.gz >external_db.99.txt.gz
gunzip external_db.99.txt.gz
grep -P '\r' external_db.99.txt  # no output

curl ftp://ftp.ensembl.org/pub/release-98/mysql/mus_musculus_core_98_38/external_db.txt.gz >external_db.98.txt.gz
gunzip external_db.98.txt.gz
grep -P '\r' external_db.98.txt  # 2 lines of output!

Then from R:

extdbs99 <- read.table("external_db.99.txt", header=FALSE, sep="\t", quote="", comment.char="")
dim(extdbs99)
# [1] 446  10

extdbs98 <- read.table("external_db.98.txt", header=FALSE, sep="\t", quote="", comment.char="")
# Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
#   line 311 did not have 10 elements

FWIW I can see the \r characters in the vim editor: they are displayed as blue ^M's.

Now I'm not sure what the best course of action is. Notifying Ensembl that they produce db dumps that are hard to parse wouldn't hurt. However, I wouldn't be surprised if they knew that already because they've fixed the problem starting with release 99. It's just that they probably don't bother fixing past releases, which is understandable.

Interestingly it seems that data.table::fread() can swallow these \r characters without problem:

library(data.table)
extdbs98 <- fread("external_db.98.txt")
dim(extdbs98)
# [1] 446  10

but I'm hesitant to introduce a dependency on the data.table package. At least I would like to avoid making this a hard dep so I'm going to explore the possibility of making this a soft dep instead (i.e. listed in Suggests only).

Another possibility would be to clean the tab file before calling read.table() on it but I'm not sure how to do this in R. Also I don't want to add C code to GenomeInfoDb just for that.

:thinking:

Suggestions welcome.

H.

mjsteinbaugh commented 10 months ago

That's a great find with grep -P '\r' external_db.98.txt. I figured it was something like that but couldn't figure it out. I dug into how to clean malformed files and pass it in to base R read.table recently. You can do this by creating a textConnection. Let me see if I can make a minimal example of this with the Ensembl file mentioned above.

Update: textConnection approach is no good. We hit the issue attempting to read any source code lines first. data.table does some internal C jiu jitsu before it reads the lines to handle the \r character.

download.file(
    url = "ftp://ftp.ensembl.org/pub/release-98/mysql/mus_musculus_core_98_38/external_db.txt.gz",
    destfile = "external_db.98.txt.gz"
)

## This step messes up on the `\r` characters, and returns length 449 instead
## of expected 446. So this won't work.
lines <- readLines("external_db.98.txt.gz")
print(length(lines))
## 449

## Normally you can create a `textConnection` to pass in corrected source lines.
## But the `\r` carriage character is a tricky edge case that messes up the
## number of source lines.
con <- textConnection(lines)
df <- read.table(
    file = con,
    header = FALSE,
    sep = "\t",
    quote = "",
    comment.char = ""
)
## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
##   line 311 did not have 10 elements
## Calls: read.table -> scan

print(lines[[1L]])
## "211\tVB_Community_Symbol\t\\N\tKNOWN\t50\tVB Community Symbol\tMISC\t\\N\t\\N\t\\N"

print(lines[[311L]])
## [1] "\\nMore information in there:"
hpages commented 10 months ago

Addressed in GeomeInfoDb 1.38.3 (release) and 1.39.3 (devel). See commit fe81f0d423f07efabb6e61d6c57620171a6b75b7.

Please let me know how that works for you.