using query_missingGenes_from_module on multiple modules

arianccbasile commented 5 years ago

Dear Andreas, I am trying to use your script query_missingGenes_from_module to spot missing genes from module but I need to do it recursively on multiple modules (a list provided) and then write the result in a file. When I do it on one module only I have no problem but when I put my list in a vector and call with a loop the elements of the vector to be my "ID_MODULE" I don't manage to format the output correctly.

Any hint? Thank you in advantage and for having provided this amazing resource. Sincerely, Arianna

asmvernon commented 5 years ago

Hi Arianna, I'm glad you're finding it useful!

The query_missingGenes_from_module returns a data frame where MISSING_GENES contains a list of the genes. There are a few options on how you can then retrieve these data and aggregate it for multiple modules depending on what you want to do with it afterward.

Here's some code with some options. Hope one of them help guide you to best achieve your goal:

library(Metqy)
# install.packages('reshape2')

data("data_example_KOnumbers_vector")
modules <- c('M00001', 'M00002')

module_missing_genes_str <- data.frame('MODULE' = modules,
                                   'MISSING_GENES' = '',
                                   stringsAsFactors = FALSE)

# Collect pairs of modules and each missing gene by constructnig the following structure:
#
#   MODULE  MISSING_GENE
#   M00001 K00001
#   M00001 K00002
#   M00001 K00003
#   M00002 K00004
module_missing_genes_pairs <- NULL

for(M in 1:length(modules)){
  OUT <- query_missingGenes_from_module(data_example_KOnumbers_vector,modules[M])
  these_genes_list <- OUT$MISSING_GENES
  # collapse list into string
  these_genes_str <- paste(these_genes_list, collapse=',')
  module_missing_genes_str$MISSING_GENES[M] <- these_genes_str

  # Append the combination of the module ID repeted as many times as there are missing genes 
  #                             combine rows together
  module_missing_genes_pairs <- rbind(module_missing_genes_pairs,
                                      # combine two list into pair-wise columns
                                      cbind(rep(modules, length(these_genes_list)), these_genes_list))

}
# Tiddy module_missing_genes_pairs up 
module_missing_genes_pairs <- data.frame(module_missing_genes_pairs, stringsAsFactors = FALSE)
names(module_missing_genes_pairs) <- c('MODULE', 'MISSING_GENES')

# GENERATE A MATRIX OF THE MODULES AND THE MISSING GENES WHERE '1' INDICATES THE GENE IS MISSING FOR THAT MODULE
module_missing_genes_pairs$Value <- 1
module_missing_genes_matrix <- reshape2::dcast(module_missing_genes_pairs,formula = MODULE~MISSING_GENES,value.var = "Value",drop = F,fill = 0)
# Use the module ID to name the rows
rownames(module_missing_genes_matrix) <- module_missing_genes_matrix[,1]
# Remove the colums with the module IDS to have a numeric matrix

module_missing_genes_matrix           <- module_missing_genes_matrix[,2:ncol(module_missing_genes_matrix)]

You can find the script here

Let me know if you need any further help!

Also, I'd be interested to know what you are using it for :)

All the best, Andrea

arianccbasile commented 4 years ago

It sounds super good! I will try it asap. I was using it for gapfilling of metabolic models but actually I abandoned this way and used an other tool at the very ending. Now this pipeline came to my mind because I am doing metagenome assembled genomes annotation and I am back to this script.

Random question, do you plan to update the db?

All the best, Arianna

asmvernon commented 4 years ago

Hi Arianna, Hope you find it useful!

I don't plan to update the db, but you can pass updated data if you have access to it :) Please look at the documentation on how to do this by using the use_module_reference_table and use_genome_reference_table argument fields

Best, A

OSS-Lab / MetQy

using query_missingGenes_from_module on multiple modules #3