Closed arcolombo closed 8 years ago
successful rebuild of a FASTA file with path name , testing the basename addition to TxDbLite calls Nb. look at source file..
EnsDbLite(ens) EnsDbLite : |package_name: EnsDbLite.Hsapiens.81 |db_type: EnsDbLite |type_of_gene_id: Ensembl Gene ID |created_by: TxDbLite 1.9.113 |creation_time: Fri Aug 19 14:26:41 2016 |organism: Homo sapiens |genome_build: GRCh38 |source_file: ~/Documents/github_repos/arkasData/inst/extdata/fasta/Homo_sapiens.GRCh38.81.cdna.all.fa.gz | 175372 transcripts from 38530 bundles (genes).
successful rebuild of H.Sapiens RepBase 2005
rep [1] "RepDbLite.Hsapiens.2005.sqlite" RepDbLite(rep) RepDbLite : |package_name: RepDbLite.Hsapiens.2005 |db_type: RepDbLite |type_of_gene_id: RepBase identifiers |created_by: TxDbLite 1.9.113 |creation_time: Fri Aug 19 14:30:55 2016 |organism: Homo sapiens |genome_build: RepBase20_05 |source_file: Homo_sapiens.RepBase.20_05.merged.fa.gz | 1116 repeat exemplars from 68 repeat families (no known genes). makeRepDbLitePkg(rep) Creating package in ./RepDbLite.Hsapiens.2005 [1] "RepDbLite.Hsapiens.2005"
library(RepDbLite.Hsapiens.2005) RepDbLite.Hsapiens.2005 RepDbLite : |package_name: RepDbLite.Hsapiens.2005 |db_type: RepDbLite |type_of_gene_id: RepBase identifiers |created_by: TxDbLite 1.9.113 |creation_time: Fri Aug 19 14:30:55 2016 |organism: Homo sapiens |genome_build: RepBase20_05 |source_file: Homo_sapiens.RepBase.20_05.merged.fa.gz | 1116 repeat exemplars from 68 repeat families (no known genes).
package loading is working fine (I am an idiot)
need to check Mouse, and perhaps Drosophila Melanogaster.
successfully rebuild a Mouse TxDbLite ENSEMBL Package.
mus<-ensDbLiteFromFasta("Mus_musculus.GRCm38.cdna.all.fa.gz") Loading required package: org.Mm.eg.db
Extracting transcript lengths...done. Extracting transcript descriptions...done. Extracting genomic coordinates...done. Extracting gene and biotype associations...done. Tabulating GC content...done. Tabulating transcript biotypes...done. Tabulating genes......done. Creating the database...done. Writing the gene table...done. Tabulating gene biotypes...done. Writing the gene_biotype table...done. Writing the tx table...done. Tabulating transcript biotypes...done. Writing the tx_biotype table...done. Writing the biotype_class table...done.
mus [1] "EnsDbLite.Mmusculus.cdna.sqlite" EnsDbLite(mus) EnsDbLite : |package_name: EnsDbLite.Mmusculus.cdna |db_type: EnsDbLite |type_of_gene_id: Ensembl Gene ID |created_by: TxDbLite 1.9.113 |creation_time: Sun Aug 21 09:16:21 2016 |organism: Mus musculus |genome_build: GRCm38 |source_file: Mus_musculus.GRCm38.cdna.all.fa.gz | 98492 transcripts from 32737 bundles (genes). makeDbLitePkg(mus) Error: could not find function "makeDbLitePkg" makeEnsDbLitePkg(mus) Creating package in ./EnsDbLite.Mmusculus.cdna [1] "EnsDbLite.Mmusculus.cdna" library(EnsDbLite.Mmusculus.cdna) EnsDbLite.Mmusculus.cdna EnsDbLite : |package_name: EnsDbLite.Mmusculus.cdna |db_type: EnsDbLite |type_of_gene_id: Ensembl Gene ID |created_by: TxDbLite 1.9.113 |creation_time: Sun Aug 21 09:16:21 2016 |organism: Mus musculus |genome_build: GRCm38 |source_file: Mus_musculus.GRCm38.cdna.all.fa.gz | 98492 transcripts from 32737 bundles (genes).
Mouse RepBase Library creation and loading works
rr<-repDbLiteFromMouseFasta("Mus_musculus.RepBase.mousub_merged_rodrep.fa") Extracting repeat lengths...done. Extracting repeat descriptions...1020 uncataloged repeat biotypes, fix case... 1020 uncataloged repeat biotypes, fix Tiggers... 1019 uncataloged mouse repeat biotypes, fix Alus... Alus were not found in uncataloged mouse repeats, skipping ... 1019 uncataloged mouse repeat biotypes, fix LINE1... 937 uncataloged mouse repeat biotypes, fix MERs... 931 uncataloged mouse repeat biotypes, fix LTRs... 788 uncataloged mouse repeat biotypes, fix SVAs... SVAs were not found in the uncataloged mouse ... 788 uncataloged mouse repeat biotypes, fix SINEs... SINEs were not found in the repeat fasta for mouse ... 788 uncataloged mouse repeat biotypes, fix Mariners... 788 uncataloged mouse repeat biotypes... hinting... 0 uncataloged repeat biotypes after hinting. done. Creating the database...done. Warning message: In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, : reading FASTA file Mus_musculus.RepBase.mousub_merged_rodrep.fa: ignored 327 invalid one-letter sequence codes RepDbLite(rr) RepDbLite : |package_name: RepDbLite.Mmusculus.RepBase |db_type: RepDbLite |type_of_gene_id: RepBase identifiers |created_by: TxDbLite 1.9.113 |creation_time: Sun Aug 21 09:34:28 2016 |organism: Mus musculus |genome_build: RepBasemousub_merged_rodrep |source_file: Mus_musculus.RepBase.mousub_merged_rodrep.fa | 1563 repeat exemplars from 72 repeat families (no known genes). makeRepDbLitePkg(rr) Creating package in ./RepDbLite.Mmusculus.RepBase [1] "RepDbLite.Mmusculus.RepBase" library(RepDbLite.Mmusculus.RepBase) RepDbLite.Mmusculus.RepBase RepDbLite : |package_name: RepDbLite.Mmusculus.RepBase |db_type: RepDbLite |type_of_gene_id: RepBase identifiers |created_by: TxDbLite 1.9.113 |creation_time: Sun Aug 21 09:34:28 2016 |organism: Mus musculus |genome_build: RepBasemousub_merged_rodrep |source_file: Mus_musculus.RepBase.mousub_merged_rodrep.fa | 1563 repeat exemplars from 72 repeat families (no known genes).
closing I'm okay with my master branch. i am going to merge my master (PR ) into head. the mouse / human ENSEMBL and RepBase stuff works.
I could test Drosophila Melanogaster, but the Bioc package is only going to support Mus and Homo initially
So I've successfully rebuilt H.Sapiens ENSEMBL 81 packages; the only fault is that it does not load the DB on package loading (To be fixed soon). need to check Mouse, and human repeat package builds.
1] "EnsDbLite.Hsapiens.81.sqlite"