Bioconductor / txdbmaker

A set of tools for making TxDb objects from genomic annotations from various sources (e.g. UCSC, Ensembl, and GFF files)
2 stars 0 forks source link

Apply filter with use.grch37 for ensemble URLs #6

Closed LiNk-NY closed 2 months ago

LiNk-NY commented 2 months ago

It seems like some of the resources (homo_sapiens) from ensembl.org have a build number appended to the end of them:

https://ftp.ensembl.org/pub/release-112/mysql/homo_sapiens_core_112_38/ https://ftp.ensembl.org/pub/release-112/mysql/homo_sapiens_core_112_37/

This is causing an error in OrganismDbi : https://bioconductor.org/checkResults/devel/bioc-LATEST/OrganismDbi/nebbiolo2-checksrc.html

odb <- makeOrganismDbFromBiomart(transcript_ids=transcript_ids)
#' Download and preprocess the 'transcripts' data frame ... OK
#' Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)
#' Error in S4Vectors:::extract_data_frame_rows(chrominfo, keep_idx) : 
#'   is.data.frame(x) is not TRUE
#' Calls: makeOrganismDbFromBiomart -> makeTxDbFromBiomart -> <Anonymous> -> stopifnot
#' Execution halted

This PR is a patch but perhaps it may also be due to errors in ensembl.org? I would imagine that grch37 files would only show up in https://ftp.ensembl.org/pub/grch37/ and not in the latest release 112

Best, Marcel

hpages commented 2 months ago

Thanks Marcel.

Don't know why they did that since ftp.ensembl.org/pub/release-112/mysql/homo_sapiens_core_112_37/ and ftp.ensembl.org/pub/grch37/release-112/mysql/homo_sapiens_core_112_37/ have identical content but the latter has always been the canonical location for the GRCh37 data.

Doing core_dir <- grep("37$", core_dir, invert=!use.grch37, value=TRUE) is assuming that no other organism will ever have the _37 prefix, which I don't think we can safely assume. Also I think we shouldn't do anything when use.grch37 is TRUE, because in that case there should never be the need to filter anything out.

So I'd be more comfortable with something like:

## Starting with Ensembl 112, ftp.ensembl.org/pub/release-<version>/mysql/
## contains two core subdirs for homo_sapiens: homo_sapiens_core_<version>_38/
## and homo_sapiens_core_<version>_37/. We filter out the latter.
if (!use.grch37)
    grep("^homo_sapiens_core_.*_37$", core_dir, invert=TRUE, value=TRUE)

Thanks again, H.

LiNk-NY commented 2 months ago

Thanks for taking a look Hervé @hpages , I've updated the code based on your comment. -Marcel