Add new attributes (annotations) to a mongodb collection

rodtheo commented 3 years ago

Motivation

The gff-toolbox convert module is capable of converting a GFF to a mongo database, however it seems that we can't manipulate the gff information stored in database without relying on pure mongo commands. A useful routine task done in many analysis is the insertion of new information into a GFF, i.e annotate a gene/transcript. This task could be done using many tools, such as gffutils, BCBio or even bash/other language script, through the inclusion of new attributes into the raw GFF column 9 using as input a file telling which set of annotations (e.g. GO, PFAM, EC number) correlate with each gene. However, the same annotation task can also be done in a different way involving the conversion of GFF to mongodb, inclusion of annotations to mongodb corresponding collection and, further, if desired, conversion back to GFF. I've been wondering that despite it may seem a more difficult procedure than annotate a raw GFF and spend more computer resources, it has some advantages:

Possibility to include a description, link or other metadata related to an annotation. GFF format spec declares fields Ontology_terms and Dbxref in column 9 to accommodate, respectively, annotations from GO/ontology servers and other databases (e.g. PFAM, PANTHER, EC). Despite this, I lack a description field for each term annotated in a GFF. A thing that can be easily done in mongodb. Indeed, it can be included in description field of GFF, but it brings to my next GFF issue: noisy/polluted GFF;
Clean visualization/reading of gene attributes in gffs/genomes having enormous quantity of annotations; and
Go biond GFFs. In some situations, we would like to gather the information contatined in GFF into a different format. For instance, higlass visualization tool requires refseq format to display genes. Therefore, generate a file with refseq specs from a stored mongodb is easier than manipulate a raw gff.

Proposed solution

Probably the list of advantages and disadvantages of using a mongodb as intermediate to accomodate annotations is bigger than I could think of, but I see this way as a facilitator. Hence, I propose a new gff-toolbox module to perform this task, i.e. annotate a mongodb collection created by gfftool-box convert. In the following i will try to explain the main architecture of this module, that at first hand I nominated as ingest.

We would like the ingest module to receive a set of annotations and include them in corresponding gene/transcript entry in mongodb. Thus, assume that the mongodb was created by gff-toolbox convert module - parameters XXX; XXX; - and that we also have a txt/tsv file tab-separated with annotations such as the following:

##ID    Id  IdType  description
gene-KPHS_00170 PTHR30520:SF0   PANTHER TRANSPORTER-RELATED
gene-KPHS_00170 GO:0006810  GO  transport
gene-KPHS_00170 3.4.16.2    EC  Lysosomal Pro-Xaa carboxypeptidase
gene-KPHS_00170 GO:0005215  GO  transporter activity
gene-KPHS_02590 GO:0003735  GO  structural constituent of ribosome
gene-KPHS_02590 PTHR36029   PANTHER

Inspecting the mongodb collection that correponds to gene-KPHS_00170 we can retrieve the json listing it's information:

{'_id': ObjectId('612e788a94ee11baab643fb0'),
  'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '22533',
  'end': '22802',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_00170',
   'Dbxref': 'GeneID:11844995',
   'Name': 'KPHS_00170',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_00170'}

The aim of the proposed gff-toolbox ingest module is to insert the annotations into corresponding gene in mongodb. After this procedure, we would like to have mongodb entry for gene-KPHS_00170 stored as:

{'_id': ObjectId('612e788a94ee11baab643fb0'),
  'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '22533',
  'end': '22802',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_00170',
   'Dbxref': 'GeneID:11844995',

   'Dbxref': [  'GeneID:11844995' ,
                    {'DBTAG': 'PANTHER', 'ID': 'PTHR30520:SF0', 'Description': 'FORMATE TRANSPORTER-RELATED'},
                    {'DBTAG': 'PANTHER', 'ID': 'PTHR30520', 'Description': 'FORMATE TRANSPORTER-RELATED'},
                    {'DBTAG': 'PFAM', 'ID': 'PF01226', 'Description': 'Formate/nitrite transporter'}
                    ],
   'Ontology_term': [ {'DBTAG': 'GO', 'ID': 'GO:0006810', 'Description': 'transport'}, 
                    {'DBTAG': 'GO', 'ID': 'GO:0016020', 'Description': 'membrane'},
                    {'DBTAG': 'GO', 'ID': 'GO:0005215', 'Description': 'transporter activity'}
                    ],

   'Name': 'KPHS_00170',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_00170'}

According to GFF spec, "two reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database" (i.e. annotations). Also, "the value of both Ontology_term and Dbxref is the ID of the cross referenced object in the form "DBTAG:ID". The DBTAG indicates which database the referenced object can be found in, and ID indicates the identifier of the object within that database". Therefore, in mongodb schema we include an object for each annotation declaring DBTAG, ID and optional fields such as Description. Unfortunately, this is not the json schema declared in gff-toolbox convert module: the Dbxref entries generated after parsing a GFF to mongodb do not separate the DBTAG and ID fields. We can fix this, by simply adjusting the code to separate those fields before parsing the json into mongodb collection. I propose to fix this, but I have to know if this can bring any problem in other gff-toolbox modules. @fmalmeida, can it?

Another suggestion, I would like your opinion, if we should decouple the "ingestion" of annotations to mongodb - that is the proposed solution in this issue - and the "digestion" of a mongodb collection into a GFF/other file format. I think, another gff-toolbox module or even the gff-toolbox convert could be an answer to this question.

@fmalmeida, let me know what you think about it and if I can submit the pull request - I kind of have a code that can be adjusted to become the aforementioned gff-toolbox ingest module.

fmalmeida commented 3 years ago

Based on the advantages you've described in the first part of the issue, I think it is a nice option to maintain in the package the "portability" and ability to work with mongodb databases from GFF.

I like the idea you've proposed and I believe it would be a good addition to the package.

About the concern: We can fix this, by simply adjusting the code to separate those fields before parsing the json into mongodb collection. I propose to fix this, but I have to know if this can bring any problem in other gff-toolbox modules. @fmalmeida, can it?

-> You can do it without worries. None of the other modules would be affected. All the other modules use the GFF files from the scratch, creating a GFF biopython DB each time. The Json is exclusively used by the mongodb conversion function.

And about: Another suggestion, I would like your opinion, if we should decouple the "ingestion" of annotations to mongodb - that is the proposed solution in this issue - and the "digestion" of a mongodb collection into a GFF/other file format. I think, another gff-toolbox module or even the gff-toolbox convert could be an answer to this question.

-> I think having it decoupled is the best way to have it. Have a function that is called one time to "ingest" the annotations into an existing mongodb and save it (rewrite). And them, having another function that is called another time to convert back this mongodb to GFF or other formats.

The way the gff-toolbox convert module is designed it is meant to be used from GFF to other files. Thus, I believe that instead of creating theses functions in this module, we must create another module (with another name yet to be chosen).

Maybe something as: gff-toolbox mongo-ingest and gff-toolbox mongo-ingest (as two modules). Or gff-toolbox mongo --ingest and gff-toolbox mongo --ingest (as one module and two functions).

fmalmeida commented 3 years ago

@rodtheo, for this module a different package has been used or added? If so, we need to properly set it up in the conda package and in the Readme.

rodtheo commented 3 years ago

No, it uses the same python libraries, such as pymongo, that you've been using in other gff-toolbox modules.

fmalmeida commented 3 years ago

Perfect

fmalmeida / gff-toolbox

Add new attributes (annotations) to a mongodb collection #4

Motivation

Proposed solution