Bioconductor / Organism.dplyr

https://bioconductor.org/packages/Organism.dplyr
3 stars 3 forks source link

atlantic salmon annotation #18

Closed marwa38 closed 11 months ago

marwa38 commented 1 year ago

Could you please add Atlantic salmon as an annotation file under organism in biocondutor? https://bioconductor.org/packages/release/BiocViews.html#___Organism Many thanks in advance

lshep commented 1 year ago

There are some resources in AnnotationHub

You can search for the

> query(ah, c("salmo", "salar"))
AnnotationHub with 82 records
# snapshotDate(): 2023-06-23
# $dataprovider: Ensembl, FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWISSPROT,TREMBL,E...
# $species: salmo salar, Salmo salar
# $rdataclass: GRanges, TwoBitFile, EnsDb, SQLiteFile, OrgDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH78891"]]' 

             title                                           
  AH78891  | Ensembl 99 EnsDb for Salmo salar                
  AH79444  | Salmo_salar.ICSASG_v2.99.abinitio.gtf           
  AH79445  | Salmo_salar.ICSASG_v2.99.chr.gtf                
  AH79446  | Salmo_salar.ICSASG_v2.99.gtf                    
  AH79796  | Ensembl 100 EnsDb for Salmo salar               
  ...        ...                                             
  AH111196 | Salmo_salar.Ssal_v3.1.109.abinitio.gtf          
  AH111197 | Salmo_salar.Ssal_v3.1.109.chr.gtf               
  AH111198 | Salmo_salar.Ssal_v3.1.109.gtf                   
  AH111452 | LRBaseDb for Salmo salar (Atlantic salmon, v005)
  AH111638 | org.Salmo_salar.eg.sqlite                       

and more

> query(ah, c("salmon", "atlantic"))
AnnotationHub with 5 records
# snapshotDate(): 2023-06-23
# $dataprovider: FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWISSPROT,TREMBL,ENSEMBL,CE...
# $species: Salmo salar
# $rdataclass: SQLiteFile
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH91764"]]' 

             title                                           
  AH91764  | LRBaseDb for Salmo salar (Atlantic salmon, v001)
  AH97832  | LRBaseDb for Salmo salar (Atlantic salmon, v002)
  AH100540 | LRBaseDb for Salmo salar (Atlantic salmon, v003)
  AH107261 | LRBaseDb for Salmo salar (Atlantic salmon, v004)
  AH111452 | LRBaseDb for Salmo salar (Atlantic salmon, v005)
hpages commented 1 year ago

@marwa38

FWIW I added the Salmo_salar term to the biocViews vocabulary: https://github.com/Bioconductor/biocViews/blob/cbf0ec7d111b5f244e51ff2a95b48068b6e86ed8/inst/dot/biocViewsVocab.dot#L256. Note that the Salmo_salar view won't show up here until at least one package adds the Salmo_salar term to its biocViews field.

I also registered a few Salmo salar NCBI assemblies in the GenomeInfoDb package:

library(GenomeInfoDb)
registered_NCBI_assemblies("salmo")[ , c(1:3, 5)]
#      organism        assembly       date assembly_accession
# 1 Salmo salar       Ssal_v3.1 2021/04/21    GCF_905237065.1
# 2 Salmo salar USDA_NASsal_1.1 2022/01/12    GCA_021399835.1
# 3 Salmo salar Ssal_Brian_v1.0 2022/04/01    GCA_923944775.1
# 4 Salmo salar       Ssal_ALTA 2022/05/11    GCA_931346935.2

This allows you to easily retrieve chromosome/scaffolds names and attributes for a given assembly:

ssal_chrom_info <- getChromInfoFromNCBI("Ssal_v3.1")

dim(ssal_chrom_info)
# [1] 4011   10

ssal_chrom_info[1:10, c(1:2, 8, 10)]
#    SequenceName       SequenceRole SequenceLength circular
# 1         ssa01 assembled-molecule      174498729    FALSE
# 2         ssa02 assembled-molecule       95481959    FALSE
# 3         ssa03 assembled-molecule      105780080    FALSE
# 4         ssa04 assembled-molecule       90536438    FALSE
# 5         ssa05 assembled-molecule       92788608    FALSE
# 6         ssa06 assembled-molecule       96060288    FALSE
# 7         ssa07 assembled-molecule       68862998    FALSE
# 8         ssa08 assembled-molecule       28860523    FALSE
# 9         ssa09 assembled-molecule      161282225    FALSE
# 10        ssa10 assembled-molecule      125877811    FALSE

Finally it also makes it super easy to forge a BSgenome package for a given assembly, using the BSgenomeForge package:

library(BSgenomeForge)
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_905237065.1",
                             pkg_maintainer="Jane Doe <janedoe@gmail.com>")

Let us know if that addresses your issue so we can close. Thanks!

marwa38 commented 11 months ago

Thanks so much :)