`getChromInfoFromEnsembl()` fails with Ensembl 103 for GenomeInfoDb 1.26.2 (Bioc 3.12)

mjsteinbaugh commented 3 years ago

Hi, I'm seeing an error pop up with getChromInfoFromEnsembl() specifically with the new Ensembl 103 release. Here's a minimal reprex:

## Ensembl 102 works as expected.
seqinfo_hs_102 <-
    GenomeInfoDb::getChromInfoFromEnsembl(
        species = "Homo sapiens",
        release = 102L,
        as.Seqinfo = TRUE
    )

## Seqinfo object with 944 sequences (1 circular) from GRCh38.p13 genome:
##   seqnames     seqlengths isCircular     genome
##   KI270510.1         2415       <NA> GRCh38.p13
##   KI270539.1          993       <NA> GRCh38.p13
##   KI270395.1         1143       <NA> GRCh38.p13
##   KI270752.1        27745       <NA> GRCh38.p13
##   KI270388.1         1216       <NA> GRCh38.p13
##   ...                 ...        ...        ...
##   HG1466_PATCH      17435       <NA> GRCh38.p13
##   HG1506_PATCH      28824       <NA> GRCh38.p13
##   HG1507_PATCH      68192       <NA> GRCh38.p13
##   HG439_PATCH      403128       <NA> GRCh38.p13
##   HG1509_PATCH      14678       <NA> GRCh38.p13
## Seqinfo object with 944 sequences (1 circular) from GRCh38.p13 genome:
##   seqnames     seqlengths isCircular     genome
##   KI270510.1         2415       <NA> GRCh38.p13
##   KI270539.1          993       <NA> GRCh38.p13
##   KI270395.1         1143       <NA> GRCh38.p13
##   KI270752.1        27745       <NA> GRCh38.p13
##   KI270388.1         1216       <NA> GRCh38.p13
##   ...                 ...        ...        ...
##   HG1466_PATCH      17435       <NA> GRCh38.p13
##   HG1506_PATCH      28824       <NA> GRCh38.p13
##   HG1507_PATCH      68192       <NA> GRCh38.p13
##   HG439_PATCH      403128       <NA> GRCh38.p13
##   HG1509_PATCH      14678       <NA> GRCh38.p13

## Ensembl 103 is failing due to column parsing.
seqinfo_hs_103 <-
    GenomeInfoDb::getChromInfoFromEnsembl(
        species = "Homo sapiens",
        release = 103L,
        as.Seqinfo = TRUE
    )

Error in (function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",  : 
  more columns than column names
Calls: <Anonymous> ... .simple_read_table -> do.call -> do.call -> <Anonymous>
Backtrace:
    █
 1. └─GenomeInfoDb::getChromInfoFromEnsembl(...)
 2.   └─GenomeInfoDb:::get_Ensembl_FTP_core_db_url(...)
 3.     └─GenomeInfoDb:::.find_core_db_in_Ensembl_FTP_species_index(...)
 4.       └─GenomeInfoDb:::.fetch_species_index_from_url(url)
 5.         └─GenomeInfoDb:::fetch_table_from_url(url, header = TRUE)
 6.           └─GenomeInfoDb:::.simple_read_table(destfile, ...)
 7.             ├─BiocGenerics::do.call(read.table, args)
 8.             ├─base::do.call(read.table, args)
 9.             └─(function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", ...

Is this supported in Bioc Devel? I'm taking a look in the v1.27.7 source code, but wanted to post in case others hit this error with Bioconductor 3.12.

Best, Mike

mjsteinbaugh commented 3 years ago

Relevant code is in Ensembl-utils.R

mjsteinbaugh commented 3 years ago

The issue is Ensembl's species_EnsemblVertebrates.txt file (see ftp://ftp.ensembl.org/pub/release-103/species_EnsemblVertebrates.txt for example), which has wonky formatting.

mjsteinbaugh commented 3 years ago

url <- "ftp://ftp.ensembl.org/pub/release-103/species_EnsemblVertebrates.txt"

## Base R parser chokes on the file.
x <- read.table(
    file = url,
    header = FALSE,
    sep = "\t",
    comment.char = "#"
)

Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string
Calls: read.table -> scan
Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  number of items read is not a multiple of the number of columns
Calls: read.table -> scan

## readr may be a good alternative.
x <- readr::read_delim(
    file = url,
    delim = "\t",
    col_names = FALSE,
    comment = "#"
)

##                     X1                          X2                 X3     X4
## 1        Spiny chromis acanthochromis_polyacanthus EnsemblVertebrates  80966
## 2 Eurasian sparrowhawk             accipiter_nisus EnsemblVertebrates 211598
## 3          Giant panda      ailuropoda_melanoleuca EnsemblVertebrates   9646
## 4 Yellow-billed parrot            amazona_collaria EnsemblVertebrates 241587
## 5        Midas cichlid     amphilophus_citrinellus EnsemblVertebrates  61819
## 6    Clown anemonefish        amphiprion_ocellaris EnsemblVertebrates  80972
##                       X5              X6                      X7 X8 X9 X10
## 1            ASM210954v1 GCA_002109545.1 2018-05-Ensembl/2020-03  N  N   N
## 2 Accipiter_nisus_ver1.0 GCA_004320145.1 2019-07-Ensembl/2019-09  N  N   N
## 3            ASM200744v2 GCA_002007445.2 2020-05-Ensembl/2020-06  N  N   N
## 4            ASM394721v1 GCA_003947215.1 2019-07-Ensembl/2019-09  N  N   N
## 5               Midas_v5 GCA_000751415.1 2018-05-Ensembl/2018-07  N  N   N
## 6              AmpOce1.0 GCA_002776465.1 2018-05-Ensembl/2018-07  N  N   N
##   X11 X12 X13                                    X14 X15 X16
## 1   Y   Y   Y acanthochromis_polyacanthus_core_103_1   1  NA
## 2   N   Y   Y             accipiter_nisus_core_103_1   1  NA
## 3   Y   Y   Y      ailuropoda_melanoleuca_core_103_2   1  NA
## 4   N   Y   Y            amazona_collaria_core_103_1   1  NA
## 5   Y   Y   Y     amphilophus_citrinellus_core_103_5   1  NA
## 6   Y   Y   Y        amphiprion_ocellaris_core_103_1   1  NA

## vroom also works.
x <- vroom::vroom(
    file = url,
    delim = "\t",
    col_names = FALSE,
    col_types = vroom::cols(),
    comment = "#"
)

hpages commented 3 years ago

Hi @mjsteinbaugh ,

You only managed to read the file with readr::read_delim() or vroom::vroom() because you skipped the header. Note that skipping the header also "works" with base::read.table():

read.table("species103.txt", sep="\t", quote="", comment.char="#")

(This didn't work for you because you didn't specify quote="".)

Anyways, the file is broken because the data lines have one more column than the header so can no longer be mapped to the header. Ignoring the header doesn't solve anything.

This will need to be reported to the Ensembl folks.

mjsteinbaugh commented 3 years ago

Ah gotcha thanks @hpages . Want me to email Ensembl?

hpages commented 3 years ago

Sure. Would be awesome if you could do that. Thanks!

hpages commented 3 years ago

Hi @mjsteinbaugh, I was wondering if you heard back from the Ensembl folks. Thanks.

mjsteinbaugh commented 3 years ago

@hpages I haven't emailed yet. Funny enough I was just about to do that. I'll CC you on the email once I draft that up.

hpages commented 3 years ago

Sounds good. Thanks for keeping me in the loop. Cheers

lshep commented 3 years ago

@mjsteinbaugh Did you ever get a chance to send the email to Ensembl? This is affecting a few different areas of the project. If you haven't could you considering doing it soon and still CC @hpages and myself on the email it would be greatly appreciated.

mjsteinbaugh commented 3 years ago

Sorry for the delay, I’ve been slammed! I will seriously do that this morning and will CC you as well.

mjsteinbaugh commented 3 years ago

OK I just emailed the Ensembl Help Desk through their web portal.

mjsteinbaugh commented 3 years ago

Just heard back from Ensembl that they're in the process of correcting the issue!

lshep commented 3 years ago

Excellent! thank you

hpages commented 3 years ago

Thanks @mjsteinbaugh

I confirm that the species_EnsemblVertebrates.txt file in Ensembl 103 has been repaired. (The file can be found here and its timestamp is 08-Mar-2021.)

It also introduces a new column so I've modified internal helper fetch_species_index_from_Ensembl_FTP() to handle the new column.

This is in GenomeInfoDb 1.26.4 (release) and 1.27.8 (devel).

Cheers

lshep commented 3 years ago

Thank you @hpages and @mjsteinbaugh !

Bioconductor / GenomeInfoDb

`getChromInfoFromEnsembl()` fails with Ensembl 103 for GenomeInfoDb 1.26.2 (Bioc 3.12) #21