NCBI-Hackathons / GeneHummus

An Automated Pipeline to Classify Gene Families based on Protein Domain Organization using Auxin Response Factors in Legumes as an Example
Other
7 stars 2 forks source link

Loading taxonomic ids #9

Closed godwinjames closed 4 years ago

godwinjames commented 4 years ago

Hi,

I am new to R programming. sorry if it is a basic question.

I installed the genehummus but it was not inbuilt with other ids like brassicaceae ids. How to pull out other taxanomic ids based on a family from NCBI? and load into R?

Thanks

jdieramon commented 4 years ago

Hi @godwinjames . This is a very good question. Ids for other families beyond legumes are : brassicaceaeIds , cucurbitaceaeIds , rosaceaeIds , and solanaceaeIds . They come with geneHummus installation. It is just they are not exported from the NAMESPACE, so you need to call the ::: operator. For example, if you were interested in the Auxin Response Factor members from species within the Brassicaceae, try this :

# Load library
library(geneHummus)

# Define Conserved Domains
arf <- c("pfam02362", "pfam06507", "pfam02309")
archids <- getArch_ids(arf)

# Set filter: CD name
my_filter <- c("B3_DNA", "Auxin_resp")
# Set filter: gene family name 
family_name = "auxin response factor" 

# Filter architectures
filtered_archids <- filterArch_ids(archids, my_filter, family_name)

# Get protein (electronic) ids
arf_brassica = getProteins_from_tax_ids(filtered_archids, geneHummus:::brassicaceaeIds)

# Extract protein accessions
arf_accs_brass   = getAccessions(arf_brassica)

# Number of proteins per species
accessions_by_spp(arf_accs_brass)

# Results 
arf_accs_brass

Anyway, you may want to work with other species not installed with geneHummus. If you need to pull the taxonomy ids, check this out.

godwinjames commented 4 years ago

Thanks for the code @jdieramon and clarifying my doubt I am trying conservation of UBIQUITIN SPECIFIC PROTEASES family proteins in Brassicaceae

I used the following code and it says NCBI servers are busy. what does this mean?

library(geneHummus) ubp <- c ("PF06337", "PF00443") archids <- getArch_ids(ubp) my_filter <- c("DUSP", "UCH") family_name = "ubiquitin specific proteases" filtered_archids <- filterArch_ids (archids, my_filter, family_name) ubp_brassica = getProteins_from_tax_ids(filtered_archids, geneHummus:::brassicaceaeIds) ubp_accs_brass = getAccessions(ubp_brassica) NCBI servers are busy. Please try again a bit later.

jdieramon commented 4 years ago

Hi @godwinjames . There are some important issues here. Gene families usually have highly conserved domain classes and labels. UBP family is a bit special. For example, in Arabidopsis UBP2 has 4 conserved domains : cd02667, cl02553, cl35019, pfam02148, whereas UBP14 has only two (and different CDs!) : cd02658, cl34941 . This is a potential problem because you need all possible conserved domains to initialize your first vector. The second problem concerns the labels, as they are also different. In the same links, you'll find the labels for : UBP2 : zf-UBP and Peptidase_C19K domain-containing protein UBP14 : protein containing domains zf-UBP, Peptidase_C19B, UBA1_UBP13, and UBA2_UBP5

You cannot use the family name 'ubiquitin specific proteases' as filter bc it is not in the label. As they are not the same, just pick one of those labels to act as filter. It does not matter which one. The filter will be complemented by your vector 'my_filter'. This is a string vector. The function will filter the protein architectures which labels shown all the elements of that character vector. As the labels are different, it is again a problem to select a good filter. However, if you are familiar with this gene family, you'll know that a common element in each label is : "C19". So, I would use that as 'my_filter'.

Considering those two issues, we have found the object ubp_brassica.rda. The, you can run on this object :

ubp_accs_brass = getAccessions(ubp_brassica) accessions_by_spp(ubp_accs_brass)