ben-silke / biol3209

This repo contains the code for the undertaking of the biol3209 subject.
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

DATA: summary statistics of database #18

Open ben-silke opened 2 years ago

ben-silke commented 2 years ago

collect summary statistics from dataset.

Can be done easily with django methods

ben-silke commented 2 years ago

Running get_statistics.py Getting reference data for db_name='InterPro' len(high_counts)=1696 Getting reference data for db_name='UniProtKB/Swiss-Prot' len(high_counts)=3

Code Used

    database = Database.objects.get(name=db_name)
    references = DatabaseFeatureReference.objects.filter(database=database)

    unique_db_xrefs = {reference.db_xref for reference in references}
    counts = {
        db_xref: references.filter(db_xref=db_xref).count()
        for db_xref in unique_db_xrefs    
    }
    high_counts = {
        key: value
        for key, value in counts.items() if value > 1
    }
    print(f'{len(high_counts)=}')
ben-silke commented 2 years ago

gonna use swissprot for now

ben-silke commented 2 years ago

So; I've been looking more through this data. The swissprot db_xrefs all come from two files. (NC_000913.3; NC_000964.3) the interpro references all come from one file (NC_0000964.3). Therefore I don't think that these databases are very good if we are looking for a more holistic view. I'll start querying the files to look at shared db_xrefs across different files. otherwise; we might need to look at using multiple databases to determine the orthology from other databases.

ben-silke commented 2 years ago

numbers are not meant to be perfect; just using ctrl f to get a sense

db_xref occurs in no.files
swiss-prot 2
interpro 1
geneID 7
RFAM 550
taxon 556
ASAP 1
ecOCYC 1
ben-silke commented 2 years ago

taxon = ncbi taxonomy browser

ASAP EcOCYC PDB ensemble

ben-silke commented 2 years ago

rfam generally used within rRNA