dieterich-lab / scimodom

Sci- ModoM: A quantitative database of transcriptome-wide high-throughput RNA modification sites
https://dieterich-lab.github.io/scimodom/
GNU Affero General Public License v3.0
0 stars 0 forks source link

How to handle inconsistent Ensembl sources/format #113

Closed eboileau closed 2 months ago

eboileau commented 3 months ago

A clear and concise description of what the bug is.

I assumed that for most species, if not all, there would be a single source, and a more or less standard format...

What differences there is between https://ftp.ensembl.org/pub/release-110/gtf/ and https://ftp.ensemblgenomes.ebi.ac.uk/pub/ ? For some species there is an overlap, but not all, and the version numbering is different, e.g.

https://ftp.ensembl.org/pub/release-110/gtf/saccharomyces_cerevisiae/ vs. https://ftp.ensemblgenomes.ebi.ac.uk/pub/fungi/release-59/gtf/saccharomyces_cerevisiae/ ?

cf. https://www.ensembl.org/index.html, http://bacteria.ensembl.org/index.html, http://fungi.ensembl.org/info/data/ftp/index.html and http://plants.ensembl.org/info/data/ftp/index.html.

Output or error messages.

For now, these were omitted:

562 Escherichia coli E. coli a5b3e1ab
3702 Arabidopsis thaliana A. thaliana aba90094

because I don't know which source and which format to use...

... and we have assemblies for

|  4 | R64-1-1   |    4932 | K9FeTPiZ4abQ | sacCer3  |
|  5 | WBcel235  |    6239 | K9FeTPiZ4abQ | ce11     |

but no annotation, due to

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://ftp.ensembl.org/pub/release-110/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.110.chr.gtf.gz

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://ftp.ensembl.org/pub/release-110/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.110.chr.gtf.gz
eboileau commented 3 months ago

Do biotypes definition are consistent across species, or does this also varies?

CDieterich commented 2 months ago

Hi @eboileau - please do not look at ftp.ensemblgenomes... this is a different type of project. We only support what is on main EnsEMBL i.e. ftp.ensembl.org

EnsEMBL supports vertebrates mostly plus model organisms such as yeast, fly and nematode (C. elegans).

Plants and bacteria are not supported. We need to defer this to later.

only https://ftp.ensembl.org/pub/release-110/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.110.gtf.gz works, but also https://ftp.ensembl.org/pub/release-110/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.110.chr.gtf.gz and https://ftp.ensembl.org/pub/release-110/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.110.gtf.gz

eboileau commented 2 months ago

Ok, for now this is what we do:

  1. Fix the ensembl REST server to point to release 110, to match our annotation release.
  2. Also use the chain file for this release, i.e. release-110/assembly_chain instead of current_assembly_chain.
  3. Hard code check for yeast and worm to wrangle the GTF file name correctly (but see 4). https://github.com/dieterich-lab/scimodom/blob/3c9e10062f5edbde9ee0aa2770c80f80e56af304/server/src/scimodom/services/annotation/ensembl.py#L139
  4. Temporarily remove yeast, it has no 3'UTR. We need a more general solution to handle such cases. i.e. remove
# ncbi_taxa.csv
4932    Saccharomyces cerevisiae    S. cerevisiae   fa5d5e2b
# annotation.csv 
4   110 4932    ensembl cp6qKL4t4Wws
# assembly.csv 
4   R64-1-1 sacCer3 4932    K9FeTPiZ4abQ

and patch database (this should be w/o problem now, as we have so far no yeast data)

delete from assembly where id = 4;
delete from annotation where id = 4;
delete from ncbi_taxa where id = 4932;

We need to come back to this issue at a later time point. We will most likely need to add yeast and bacteria sooner or later.


Another general problem is that of chain files. This https://ftp.ensembl.org/pub/release-110/assembly_chain/ is limited to

caenorhabditis_elegans/
danio_rerio/
homo_sapiens/
mus_musculus/
saccharomyces_cerevisiae/

As for biotypes, we need to check in details how definitions vary between organisms, e.g. using GET info/biotypes/..., and how this differ from this or from our definitions in BIOTYPES (specifications.py).