aifimmunology / MOCHA

R package for single-cell Open Chromatin Identification & Downstream Analysis
https://aifimmunology.github.io/MOCHA/
GNU General Public License v3.0
2 stars 0 forks source link

Genome fix #111

Closed imran-aifi closed 1 year ago

imran-aifi commented 1 year ago

A MOCHA object created on one filesystem will contain a genome in the metadata - this is not the full BSGenome database, but the in-memory object pointing to it. An issue arises when the MOCHA object is shared to another computer/filesystem, and the BSGenome database the genome object points to no longer exists or does not exist in the location it points to.

We can get around this by catching the error and exposing genome as a parameter so the user can provide the correct genome and is required to have it installed.

markphillippebworth commented 1 year ago

Extracting the genome from the MOCHA object is non-trivial. Could we have it set so that if it's null, it checks the directory for a saved genome file, and returns an useful error if it's not found?

imran-aifi commented 1 year ago

Changes:

$Genome

$TxDb & $Org

All:

imran-aifi commented 1 year ago

We now save the package names of the genome, TxDb, and OrgDb. We use MOCHA:::getAnnotationDbFromInstalledPkgname and BSgenome::getBSgenome to check installation and load these package from the name. Saved metadata now looks like this:

> metadata(SampleTileMatrix)
$CellCounts
           cellTypeLabelList
             C2  C5
  PBMCSmall 152 201

$FragmentCounts
              C2     C5
PBMCSmall 117146 171033

$Genome
[1] "hg19"

$TxDb
$TxDb$pkgname
[1] "TxDb.Hsapiens.UCSC.hg38.refGene"

$TxDb$metadata
                                       name                                        value
1                                   Db type                                         TxDb
2                        Supporting package                              GenomicFeatures
3                               Data source                                         UCSC
4                                    Genome                                         hg38
5                                  Organism                                 Homo sapiens
6                               Taxonomy ID                                         9606
7                                UCSC Table                                      refGene
8                                UCSC Track                                  NCBI RefSeq
9                              Resource URL                      http://genome.ucsc.edu/
10                          Type of Gene ID                               Entrez Gene ID
11                             Full dataset                                          yes
12                         miRBase build ID                                         <NA>
13                        Nb of transcripts                                        88816
14                            Db created by    GenomicFeatures package from Bioconductor
15                            Creation time 2021-04-28 16:30:46 +0000 (Wed, 28 Apr 2021)
16 GenomicFeatures version at creation time                                       1.41.3
17         RSQLite version at creation time                                        2.2.6
18                          DBSCHEMAVERSION                                          1.2

$OrgDb
$OrgDb$pkgname
[1] "org.Hs.eg.db"

$OrgDb$metadata
                 name                                                 value
1     DBSCHEMAVERSION                                                   2.1
2             Db type                                                 OrgDb
3  Supporting package                                         AnnotationDbi
4            DBSCHEMA                                              HUMAN_DB
5            ORGANISM                                          Homo sapiens
6             SPECIES                                                 Human
7        EGSOURCEDATE                                            2021-Sep13
8        EGSOURCENAME                                           Entrez Gene
9         EGSOURCEURL                  ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
10          CENTRALID                                                    EG
11              TAXID                                                  9606
12       GOSOURCENAME                                         Gene Ontology
13        GOSOURCEURL http://current.geneontology.org/ontology/go-basic.obo
14       GOSOURCEDATE                                            2021-09-01
15     GOEGSOURCEDATE                                            2021-Sep13
16     GOEGSOURCENAME                                           Entrez Gene
17      GOEGSOURCEURL                  ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
18     KEGGSOURCENAME                                           KEGG GENOME
19      KEGGSOURCEURL                  ftp://ftp.genome.jp/pub/kegg/genomes
20     KEGGSOURCEDATE                                            2011-Mar15
21       GPSOURCENAME             UCSC Genome Bioinformatics (Homo sapiens)
22        GPSOURCEURL                                                      
23       GPSOURCEDATE                                            2021-Jul20
24       ENSOURCEDATE                                            2021-Apr13
25       ENSOURCENAME                                               Ensembl
26        ENSOURCEURL               ftp://ftp.ensembl.org/pub/current_fasta
27       UPSOURCENAME                                               Uniprot
28        UPSOURCEURL                               http://www.UniProt.org/
29       UPSOURCEDATE                              Wed Sep 15 18:21:59 2021

$Directory
[1] "/Users/imran.mcgrath/Documents/projects/PBMCSmall/MOCHA"
imran-aifi commented 1 year ago

Merging this since the changes are tested and do not affect parallelization strategies