Incorporate range-based and genomics data into bio4j

seandavi commented 10 years ago

Bio4j is an awesome platform with huge potential for data integration. It would expand the applications for bio4j to begin to incoporate range-based data using a tree-based indexing system, for example. This would allow bio4j to be used in the genomics field more readily by enabling range-based queries (overlaps with genomic features) and rich annotation queries in the same highly flexible graphdb framework. Particularly for work involving genomic variants (snps, small indels), determining impact of a variant on a biologic system requires a mix of range-based queries and annotations such as gene-drug interactions, variant frequencies, gene function (go, pfam, etc.), etc.

eparejatobes commented 10 years ago

@seandavi thanks a lot!

This is an idea we have been toying with for years (see this for example) but we thought it could be maybe too much for gsoc. But it's definitely interesting, from the CS point of view as the domain is discrete and the problem has a different flavor, and from the user perspective there are certainly a lot of uses for range/stabbing queries like those you mention.

I'm adding this to the ideas list, just tell me if you want to co-mentor or mentor on this, cc'ing @rtobes @laughedelic @marina-manrique

eparejatobes commented 10 years ago

done: https://github.com/bio4j/gsoc14/wiki/incorporate-range-based-data-into-bio4j

feel free to edit that!

seandavi commented 10 years ago

Thanks, Eduardo. I knew it would be a lot to bite off, but thanks for thinking about it. There are a number of interval indexing schemes that might be applied in a hierarchical (graph) manner. One of the simplest to implement is:

http://genomewiki.ucsc.edu/index.php/Bin_indexing_system

If there is interest from your side and an interested participant, we could discuss how best to do the mentoring if it comes to that. I would be much more focused on a working implementation than on a theoretical best approach; that may or may not fit with your interests or goals, though.

On Wed, Feb 26, 2014 at 11:50 AM, Eduardo Pareja Tobes < notifications@github.com> wrote:

@seandavi https://github.com/seandavi thanks a lot!

This is an idea we have been toying with for years (see thishttps://groups.google.com/d/msg/neo4j/oyOGrwO9i2g/CR5WSf7wcMkJfor example) but we thought it could be maybe too much for gsoc. But it's definitely interesting, from the CS point of view as the domain is discrete and the problem has a different flavorhttp://www.mpi-inf.mpg.de/%7Ejeschmid/public/Schmidt2009a.pdfand from the user perspective there are certainly a lot of uses for range/stabbing queries like those you mention.

I'm adding this to the ideas list, just tell me if you want to co-mentor or mentor on this, cc'ing @rtobes https://github.com/rtobes @laughedelichttps://github.com/laughedelic @marina-manrique https://github.com/marina-manrique

— Reply to this email directly or view it on GitHubhttps://github.com/bio4j/gsoc14/issues/15#issuecomment-36147100 .

eparejatobes commented 10 years ago

Sure, the emphasis here should be on getting something working and with clear applications like those you mention; we have been missing a lot these kind of features ourselves, when doing comparative genomics, for example (@rtobes can correct me or expand on this I think)

It's just that I also like the more theoretical part :)

rtobes commented 10 years ago

@seandavi thanks for your interest in Bio4j

Many important annotations and features are linked to specific sequence intervals in protein sequences and in nucleotide sequences. Some examples in proteins:

motifs (Interpro, Pfam, ...)
Sequence features annotated in Uniprot (transmembrane regions, interaction regions, repetitive regions, ....)
Protein interaction regions extracted from databases of interactions as IntAct database
Similarity regions with other proteins. Many times the similarity between 2 proteins is highly significant but it only covers a specific region of the protein.

Some interval related annotations in nucleotide sequences:

genic regions, intergenic regions
intron and exon regions
operons
promoter regions
genomic context (pathogenicity islands, cluster of genes, transposons)
repeats
upstream or downstream with regard to genes

There are many annotations interesting to be managed linked to specific intervals of sequence. Even global functional annotations are linked to stretches of sequences more than to the whole sequence. In our work with bacterial genomics we know that the majority of the GO functional annotations in Uniprot proteins come from Interpro motifs. Practically all Interpro motifs are GO annotated. It has enriched a lot the GO annotations for bacteria but it is important to consider that all these annotations are linked to an interval of sequence.

In many cases in proteins with several sequence motifs their global name and functional annotation correspond only to one of the motifs. It could be confusing for the inference of functions based on similarity since two proteins could be similar just in the region corresponding to other functional motif different from that is naming the protein.

The possibility of working with functional annotations associated to stretches of proteins adds an additional level of granularity to the information that allows us a more exact inference of functions and more precise comparison between proteins. Probably @epareja could add some comments about this point.

bio4j / gsoc14

Incorporate range-based and genomics data into bio4j #15