Closed mjsteinbaugh closed 3 years ago
Relevant code is in Ensembl-utils.R
The issue is Ensembl's species_EnsemblVertebrates.txt
file (see ftp://ftp.ensembl.org/pub/release-103/species_EnsemblVertebrates.txt
for example), which has wonky formatting.
url <- "ftp://ftp.ensembl.org/pub/release-103/species_EnsemblVertebrates.txt"
## Base R parser chokes on the file.
x <- read.table(
file = url,
header = FALSE,
sep = "\t",
comment.char = "#"
)
Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
Calls: read.table -> scan
Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
number of items read is not a multiple of the number of columns
Calls: read.table -> scan
## readr may be a good alternative.
x <- readr::read_delim(
file = url,
delim = "\t",
col_names = FALSE,
comment = "#"
)
## X1 X2 X3 X4
## 1 Spiny chromis acanthochromis_polyacanthus EnsemblVertebrates 80966
## 2 Eurasian sparrowhawk accipiter_nisus EnsemblVertebrates 211598
## 3 Giant panda ailuropoda_melanoleuca EnsemblVertebrates 9646
## 4 Yellow-billed parrot amazona_collaria EnsemblVertebrates 241587
## 5 Midas cichlid amphilophus_citrinellus EnsemblVertebrates 61819
## 6 Clown anemonefish amphiprion_ocellaris EnsemblVertebrates 80972
## X5 X6 X7 X8 X9 X10
## 1 ASM210954v1 GCA_002109545.1 2018-05-Ensembl/2020-03 N N N
## 2 Accipiter_nisus_ver1.0 GCA_004320145.1 2019-07-Ensembl/2019-09 N N N
## 3 ASM200744v2 GCA_002007445.2 2020-05-Ensembl/2020-06 N N N
## 4 ASM394721v1 GCA_003947215.1 2019-07-Ensembl/2019-09 N N N
## 5 Midas_v5 GCA_000751415.1 2018-05-Ensembl/2018-07 N N N
## 6 AmpOce1.0 GCA_002776465.1 2018-05-Ensembl/2018-07 N N N
## X11 X12 X13 X14 X15 X16
## 1 Y Y Y acanthochromis_polyacanthus_core_103_1 1 NA
## 2 N Y Y accipiter_nisus_core_103_1 1 NA
## 3 Y Y Y ailuropoda_melanoleuca_core_103_2 1 NA
## 4 N Y Y amazona_collaria_core_103_1 1 NA
## 5 Y Y Y amphilophus_citrinellus_core_103_5 1 NA
## 6 Y Y Y amphiprion_ocellaris_core_103_1 1 NA
## vroom also works.
x <- vroom::vroom(
file = url,
delim = "\t",
col_names = FALSE,
col_types = vroom::cols(),
comment = "#"
)
Hi @mjsteinbaugh ,
You only managed to read the file with readr::read_delim()
or vroom::vroom()
because you skipped the header. Note that skipping the header also "works" with base::read.table()
:
read.table("species103.txt", sep="\t", quote="", comment.char="#")
(This didn't work for you because you didn't specify quote=""
.)
Anyways, the file is broken because the data lines have one more column than the header so can no longer be mapped to the header. Ignoring the header doesn't solve anything.
This will need to be reported to the Ensembl folks.
Ah gotcha thanks @hpages . Want me to email Ensembl?
Sure. Would be awesome if you could do that. Thanks!
Hi @mjsteinbaugh, I was wondering if you heard back from the Ensembl folks. Thanks.
@hpages I haven't emailed yet. Funny enough I was just about to do that. I'll CC you on the email once I draft that up.
Sounds good. Thanks for keeping me in the loop. Cheers
@mjsteinbaugh Did you ever get a chance to send the email to Ensembl? This is affecting a few different areas of the project. If you haven't could you considering doing it soon and still CC @hpages and myself on the email it would be greatly appreciated.
Sorry for the delay, I’ve been slammed! I will seriously do that this morning and will CC you as well.
OK I just emailed the Ensembl Help Desk through their web portal.
Just heard back from Ensembl that they're in the process of correcting the issue!
Excellent! thank you
Thanks @mjsteinbaugh
I confirm that the species_EnsemblVertebrates.txt
file in Ensembl 103 has been repaired. (The file can be found here and its timestamp is 08-Mar-2021.)
It also introduces a new column so I've modified internal helper fetch_species_index_from_Ensembl_FTP()
to handle the new column.
This is in GenomeInfoDb 1.26.4 (release) and 1.27.8 (devel).
Cheers
Thank you @hpages and @mjsteinbaugh !
Hi, I'm seeing an error pop up with
getChromInfoFromEnsembl()
specifically with the new Ensembl 103 release. Here's a minimal reprex:Is this supported in Bioc Devel? I'm taking a look in the v1.27.7 source code, but wanted to post in case others hit this error with Bioconductor 3.12.
Best, Mike