"GeneName Missing" BioMart exception handling (resend request without external_gene_name property)
BioMart Attributes mapping logic (reduces redundency, more intuitive to read)
More detailed logging
Progress tracking
INFO -- profile:: 1/32 alaibachii_eg 5530 records
Exception handling
WARNING -- xref_entrezgene:: lmajor_eg: Skipping species without entrez gene id
WARNING -- gene_main:: pvexans_eg: Retried to request species without external gene name
Overall success reports to track successful downloads for each species
Total: interpro:: 32/32 successes 1223059 records
Retrive ensembl (main) species list file from /pub/release-%s/mysql/ensemblmart%s/dataset_names.txt.gz instead of /pub/release-%s/mysql/ensemblproduction%s/species.txt.gz
More on the species list file:
Inspection shows the old species.txt have 163 records, BioMart dropdown list has 138 options, dataset_names.txt has 138 records (match dropdown list). There are 25 records(species) in species.txt that are not in dataset_names.txt. Of which, 15 of them are musmusculus*, 1 "Test" record, these are safe to exclude. The remaining 9 records include 4 records that are variations of another record, with the same taxid.
9595
gorilla_gorilla
gorilla_gorilla_gsmrt3 (only in species.txt.gz)
10029
cricetulus_griseus
cricetulus_griseus_chok1gshd
cricetulus_griseus_crigri (only in species.txt.gz)
10181
heterocephalus_glaber_male
heterocephalus_glaber_female
heterocephalus_glaber (only in species.txt.gz)
349432
saimiri_boliviensis_boliviensis
saimiri_boliviensis (only in species.txt.gz)
The following speces are just not present with the same taxid in species.txt:
chrysemys_picta_bellii Western painted turtle
physeter_macrocephalus Sperm whale
melopsittacus_undulatus Budgie
ceratotherium_simum_simum Southern white rhinoceros
orycteropus_afer_afer Aardvark
These species are not present in the BioMart dropdown list.
dataset_names.txt does not exist for the other Ensembl databases besides the main one.
species.txt exisits for all databsses and matches the BioMart dropdown list.
The findings above should justify this change.
Adds:
Changes:
More on the species list file:
Inspection shows the old species.txt have 163 records, BioMart dropdown list has 138 options, dataset_names.txt has 138 records (match dropdown list). There are 25 records(species) in species.txt that are not in dataset_names.txt. Of which, 15 of them are musmusculus*, 1 "Test" record, these are safe to exclude. The remaining 9 records include 4 records that are variations of another record, with the same taxid.
The following speces are just not present with the same taxid in species.txt:
These species are not present in the BioMart dropdown list. dataset_names.txt does not exist for the other Ensembl databases besides the main one. species.txt exisits for all databsses and matches the BioMart dropdown list. The findings above should justify this change.