Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

How to add description with makeOrgPackage #29

Closed phoebee-h closed 2 years ago

phoebee-h commented 2 years ago

Hi,

Thank you for your work. I have successfully created the orgdb. image

And I would like to add the description as "org.Mm.eg.db" which was downloaded directly from Bioconductor. Such as these info: image

Is there an alternative way to include those record for user to key-in? In the function description, I only see these arguments: image

So my script was like:

AnnotationForge::makeOrgPackage(...,
    version="1.0",
    maintainer="PHOEBE <xxx@xxx.com>",
    author="PHOEBE <xxx@xxx.com>",
    outputDir = ".",
    tax_id="7668",
    genus="Strongylocentrotus",
    species="purpuratus",
    goTable="go"
    )

Thank you. Best regards. Phoebe.

hpages commented 2 years ago

@jmacdon Do you think you can help with this? Thx

jmacdon commented 2 years ago

@phoebee-h, the makeOrgPackage function is meant to allow people to put an arbitrary set of data into a package, and the ... argument is meant to be a set of data.frames that contain the data you want to include. So you wouldn't ever really call the function using ... as an actual argument. This function is really designed for organisms that are not well annotated, and for which people might have their own data they could use.

It appears that sea urchin is pretty well annotated, so you could instead use makeOrgPackageFromNCBI, using the correct taxon ID, which is 7668. I just checked one of the files that will be used (the gene2accession.gz file from NCBI), and there are almost 112,000 annotated transcripts, so you should be OK. Please note that this function takes a long time to run because it downloads quite a bit of data, so please run the function in a directory that has the space (and for which you have write privileges).

As to your original question, it's not so simple. These packages are simply wrappers around a SQLite database, and what is being queried is the 'metadata' table. If I query the table directly, using sqlite3, here is what we get:

$ sqlite3 org.Hs.eg.sqlite
SQLite version 3.33.0 2020-08-14 13:23:32
Enter ".help" for usage hints.

sqlite> select * from metadata;
DBSCHEMAVERSION|2.1
Db type|OrgDb
Supporting package|AnnotationDbi
DBSCHEMA|HUMAN_DB
ORGANISM|Homo sapiens
SPECIES|Human
EGSOURCEDATE|2021-Sep13
EGSOURCENAME|Entrez Gene
EGSOURCEURL|ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
CENTRALID|EG
TAXID|9606
GOSOURCENAME|Gene Ontology
GOSOURCEURL|http://current.geneontology.org/ontology/go-basic.obo
GOSOURCEDATE|2021-09-01
GOEGSOURCEDATE|2021-Sep13
GOEGSOURCENAME|Entrez Gene
GOEGSOURCEURL|ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
KEGGSOURCENAME|KEGG GENOME
KEGGSOURCEURL|ftp://ftp.genome.jp/pub/kegg/genomes
KEGGSOURCEDATE|2011-Mar15
GPSOURCENAME|UCSC Genome Bioinformatics (Homo sapiens)
GPSOURCEURL|
GPSOURCEDATE|2021-Jul20
ENSOURCEDATE|2021-Apr13
ENSOURCENAME|Ensembl
ENSOURCEURL|ftp://ftp.ensembl.org/pub/current_fasta
UPSOURCENAME|Uniprot
UPSOURCEURL|http://www.UniProt.org/
UPSOURCEDATE|Wed Sep 15 18:21:59 2021

And you can see that's the same data you get when you type the package name at the R prompt. You could hypothetically add to that metadata table using sqlite (you could even do so from within R using the RMariaDB or RSQLite packages).

But that seems like a lot of work, and I am not sure it's worth it?

phoebee-h commented 2 years ago

@jmacdon I am a newbie in making my own orgDB. Thank you for your kindly reply and suggestion!

  1. How do you ensure that a certain species is OK to be created by "makeOrgPackageFromNCBI" ? What do you usually check? I was tested with makeOrgPackage so that the other info (eg. KO) could be integrated if the species is not well annotated. (Though it is indeed more complicated to make the input metadata organized.)

  2. It shows the timeout message during the process (I've tried several times...), seems that there are some troubles in downloading "idmapping_selected.tab.gz"; therefore, I think there are two solutions: (1)wget https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz --no-check-certificate again before makeOrgPackageFromNCBI. However, it could not recognize that file directly but overwrite it instead, so it timeout again. (2) set options(timeout=10000) , as mentioned in https://github.com/Bioconductor/AnnotationForge/issues/17, which works fine.

According to (1), would it be any parameter that could handle the files by "wget" beforehand? Also, I wonder can I reuse these files (gene2go.gz, gene2accession.gz...) if I want to generate another orgdb?

image

Thank you agian. Best regards. Phoebe

jmacdon commented 2 years ago

@phoebee-h

1.) You can download the gene2accession.gz file and do something like

$ zcat gene2accession.gz | awk '$1 == 7668 {print}' | wc -l
119579

After checking on the NCBI taxonomy site for the correct TaxID.

2.) You can reuse them, so long as you include rebuildCache = FALSE as an argument, or if you do the second package within a day.

phoebee-h commented 2 years ago

OK. I got it. Thank you so much for your help.