Closed phoebee-h closed 2 years ago
@jmacdon Do you think you can help with this? Thx
@phoebee-h, the makeOrgPackage
function is meant to allow people to put an arbitrary set of data into a package, and the ...
argument is meant to be a set of data.frames
that contain the data you want to include. So you wouldn't ever really call the function using ...
as an actual argument. This function is really designed for organisms that are not well annotated, and for which people might have their own data they could use.
It appears that sea urchin is pretty well annotated, so you could instead use makeOrgPackageFromNCBI
, using the correct taxon ID, which is 7668. I just checked one of the files that will be used (the gene2accession.gz file from NCBI), and there are almost 112,000 annotated transcripts, so you should be OK. Please note that this function takes a long time to run because it downloads quite a bit of data, so please run the function in a directory that has the space (and for which you have write privileges).
As to your original question, it's not so simple. These packages are simply wrappers around a SQLite database, and what is being queried is the 'metadata' table. If I query the table directly, using sqlite3, here is what we get:
$ sqlite3 org.Hs.eg.sqlite
SQLite version 3.33.0 2020-08-14 13:23:32
Enter ".help" for usage hints.
sqlite> select * from metadata;
DBSCHEMAVERSION|2.1
Db type|OrgDb
Supporting package|AnnotationDbi
DBSCHEMA|HUMAN_DB
ORGANISM|Homo sapiens
SPECIES|Human
EGSOURCEDATE|2021-Sep13
EGSOURCENAME|Entrez Gene
EGSOURCEURL|ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
CENTRALID|EG
TAXID|9606
GOSOURCENAME|Gene Ontology
GOSOURCEURL|http://current.geneontology.org/ontology/go-basic.obo
GOSOURCEDATE|2021-09-01
GOEGSOURCEDATE|2021-Sep13
GOEGSOURCENAME|Entrez Gene
GOEGSOURCEURL|ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
KEGGSOURCENAME|KEGG GENOME
KEGGSOURCEURL|ftp://ftp.genome.jp/pub/kegg/genomes
KEGGSOURCEDATE|2011-Mar15
GPSOURCENAME|UCSC Genome Bioinformatics (Homo sapiens)
GPSOURCEURL|
GPSOURCEDATE|2021-Jul20
ENSOURCEDATE|2021-Apr13
ENSOURCENAME|Ensembl
ENSOURCEURL|ftp://ftp.ensembl.org/pub/current_fasta
UPSOURCENAME|Uniprot
UPSOURCEURL|http://www.UniProt.org/
UPSOURCEDATE|Wed Sep 15 18:21:59 2021
And you can see that's the same data you get when you type the package name at the R prompt. You could hypothetically add to that metadata table using sqlite (you could even do so from within R using the RMariaDB
or RSQLite
packages).
But that seems like a lot of work, and I am not sure it's worth it?
@jmacdon I am a newbie in making my own orgDB. Thank you for your kindly reply and suggestion!
How do you ensure that a certain species is OK to be created by "makeOrgPackageFromNCBI" ? What do you usually check? I was tested with makeOrgPackage
so that the other info (eg. KO) could be integrated if the species is not well annotated. (Though it is indeed more complicated to make the input metadata organized.)
It shows the timeout message during the process (I've tried several times...), seems that there are some troubles in downloading "idmapping_selected.tab.gz"; therefore, I think there are two solutions:
(1)wget https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz --no-check-certificate
again before makeOrgPackageFromNCBI
. However, it could not recognize that file directly but overwrite it instead, so it timeout again.
(2) set options(timeout=10000)
, as mentioned in https://github.com/Bioconductor/AnnotationForge/issues/17, which works fine.
According to (1), would it be any parameter that could handle the files by "wget" beforehand? Also, I wonder can I reuse these files (gene2go.gz, gene2accession.gz...) if I want to generate another orgdb?
Thank you agian. Best regards. Phoebe
@phoebee-h
1.) You can download the gene2accession.gz file and do something like
$ zcat gene2accession.gz | awk '$1 == 7668 {print}' | wc -l
119579
After checking on the NCBI taxonomy site for the correct TaxID.
2.) You can reuse them, so long as you include rebuildCache = FALSE as an argument, or if you do the second package within a day.
OK. I got it. Thank you so much for your help.
Hi,
Thank you for your work. I have successfully created the orgdb.![image](https://user-images.githubusercontent.com/28743573/154026096-122fc75e-8f5b-46a9-864b-c211487cc731.png)
And I would like to add the description as "org.Mm.eg.db" which was downloaded directly from Bioconductor. Such as these info:![image](https://user-images.githubusercontent.com/28743573/154026897-dd463149-5359-4afd-ba10-463a5481936a.png)
Is there an alternative way to include those record for user to key-in? In the function description, I only see these arguments:![image](https://user-images.githubusercontent.com/28743573/154027328-b9813fad-34bb-4dab-bbf3-5f1e209f25b5.png)
So my script was like:
Thank you. Best regards. Phoebe.