Integrate sequence data into bio4j

eparejatobes commented 10 years ago

https://github.com/bio4j/gsoc14/wiki/integrate-sequence-data-into-bio4j

nilesh-c commented 10 years ago

@eparejatobes Hey Eduardo,

I'm a senior year Computer Science major. I've got extensive experience with Hadoop (currently I'm working on graph ranking and kemeny rank aggregation stuff with Hadoop and Spark among other things) and a few noSQL databases. Machine learning and data science excite me.

I'd love to work on this as a part of GSoC.

I've worked on ec2 in the past, but not much with DynamoDB. I've been reading a few HOW-TOs on setup and usage - seems fairly straightforward, should be easy to pick up. Also, looks like there already is an ElasticSearch river for dynamodb (https://github.com/kzwang/elasticsearch-river-dynamodb), so that'd probably make our work easier. Question is, has DynamoDB+S3 been fixed or is this flexible and you may look at other options too?

Also, I've been messing around with Titan since the past couple of weeks, and it seems like it's pretty easy to integrate it with ElasticSearch, which means that we can store sequences directly in the graph. Thoughts?

Looking forward to collaborating on this. Should be fun. :)

eparejatobes commented 10 years ago

Hi @nilesh-c

about the technology, it all depends on what kind of functionality is going to be implemented. Just being able to get the set of sequences corresponding to the result of a query would be incredibly useful; and in this case, it is the graph that acts as the index so just storing them in DynamoDB (+S3 depending on sequence size -> DynamoDB attribute limits) will be a good solution.

Of course, if what we want is something that actually offers some degree of support for sequence-level searches, we will need to look at ElasticSearch etc. There are some issues with this approach in that what ElasticSearch (and related tools) were built for does not match exactly with what we need: think of the case where you just want to determine a huge result set for further processing.

This is a really wide topic, and what exactly is appropriate will heavily depend on the specific area/need that is going to be covered (protein sequences vs genomes, for example)

nilesh-c commented 10 years ago

Hey @eparejatobes,

Thanks for the quick follow up. I see. So if we're limiting ourselves to just querying for sequences and the like and not go into sequence-level searches, a straight-forward pointer-to-DynamoDB-keys approach should work. I can follow till there.

But what kind of issues are you talking about in case of ElasticSearch? "determine a huge result set for further processing" - what do you mean by that? Could you walk me through a short example? I'm trying to understand this better so that I can grasp the issues and think about what approaches can suit this better.

Also, I'm wondering about what my next step should be. I'm cloning the repo right now; thinking about building some sort of an uber-simple prototype for this. So, the clearer I understand this the better. :)

eparejatobes commented 10 years ago

@nilesh-c sure I can expand a bit. One simple example could be searching for bacterial genomes (around 5MB each) containing a particular sequence of some functional relevance; the result could then be literally thousands of them (and this is a really small case so to say). In this case you just want to get access to a reference to these genomes, with which you are normally going to operate in parallel. I don't know a lot about ElasticSearch but this doesn't sound like the standard use case.

An interesting point here is to think about how the (static) graph topology could help in determining an efficient partitioning of indexes etc.

About where to start, I'd maybe first think about the possible options.

HTH

nilesh-c commented 10 years ago

@eparejatobes hmm, this is interesting. A bit clearer now.

So in essence I'm searching among a set of documents (around 5MB each - this is pretty big, but ES can handle this), returning lots of results and storing just the IDs of those results so that I can later fetch them and do some parallel operations on that. Doesn't seem like something ES won't be able to handle (this will be running on servers/clouds and not weak laptops, so..). ES uses Lucene which has been used in pretty heavy stuff. And ElasticSearch's parent-child functionality (storing in JSON format and all) can be used to break down the genomes into chunks maybe (guessing, not sure until I see examples of the structures).

And I'd say that if ElasticSearch fails, a fairly good option for real-time queries would be Spark and/or Shark. Disk-backed in-memory MapReduce - magnitudes faster than Hadoop.

how the (static) graph topology could help in determining an efficient partitioning of indexes

Are you talking about intelligently sharding the data, or something else?

Thanks.

eparejatobes commented 10 years ago

hey @nilesh-c

Yes, Lucene is a solid piece of technology; but there are other factors here such as the pretty specific (and wildly different depending on the case/type) needs of string matching in biology. Probably ElasticSearch could make sense for certain really specific use cases related with protein sequences. I will try to find a comprehensive survey/review about string matching/searching in biology, and link it to the idea as a reference; this one is not ideal, but you can get some insights from it:

exact search algorithms for biological sequences

About the graph topology; yes, sharding but not only that; things like taxonomy can tell you a lot about sequence similarity for example.

About the bacterial genome example I'd just say don't take it too far, it was just an example :) anyway great that you have interest; this idea was a bit open-ended on purpose, but in the end it is probably better to have at least one more well-defined option for it. I will update it tomorrow, after discussing this with the @bio4j/bio-people

It is anyway one of our top priorities.

sorry for not replying before, busy day here

eparejatobes commented 10 years ago

hey @nilesh-c we reviewed this a bit and I've just updated the idea

https://github.com/bio4j/gsoc14/wiki/integrate-sequence-data-into-bio4j

Let me know if you have any questions or something that you'd like me to expand

nilesh-c commented 10 years ago

Hi @eparejatobes,

Thanks a lot for the updates! I think I see what you mean by "things like taxonomy can tell you a lot about sequence similarity for example". I briefly read the exact search algorithms paper you linked to and now I'm getting the picture. Though ElasticSearch has provisions for custom tokenizers and analyzers, I now have my doubts about using it for this use-case; we'd have a hard time doing many kinds of stuff (like different domain specific techniques) with it and it'll probably work against us.

I'm easing into these tools and concepts, and I feel that the problem with BLAST is: it can't be used for queries over a distributed noSQL database can it? Seems like one needs to format it in a specific way to make BLAST work as fast as it does. Do you have any ideas on how to integrate it with noSQL databases (or say, the DynamoDB-S3 design)?

Also, a few months back I had studied a few papers about locality sensitive hashing (in the context of entity resolution, string similarity etc.) - LSH is easy to implement with MapReduce and works well too, so I tried searching for literature using LSH for gene sequence matching. I came across these:

Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing (Page 1 is enough to get an idea of their motivation)
Going Organic - Genomic sequence alignment in Elasticsearch

Ignore the ElasticSearch part, but they seem to be loving LSH for this. I think it might be a great idea to do something like this with Spark for doing the sequence searches. Again, I might be completely off-track here, so do point me to the right direction in that case. :)

Here just exact matches could be of use in some contexts, even more if it can be combined with graph traversals.

Could you elaborate a bit on this? Do you mean graph traversal during searching? Could you give me an example?

Also, a broad question - roughly how much time would the users expect it to take for such sequence matching queries? < 1 sec/1-2 sec/50-60 sec/more?

eparejatobes commented 10 years ago

Hi @nilesh-c

good to see that your thinking about it.

I'm easing into these tools and concepts, and I feel that the problem with BLAST is: it can't be used for queries over a distributed noSQL database can it? Seems like one needs to format it in a specific way to make BLAST work as fast as it does. Do you have any ideas on how to integrate it with noSQL databases (or say, the DynamoDB-S3 design)?

of course it cannot be directly integrated. BLAST (or LAST) requires a previous step of reference preprocessing. What I have in mind is more an integration at the API level. Imagine a traversal which would yield as output a set of proteins ref; and you have a protein sequence q which you want to use as query for that reference. Then, you could distribute the execution of BLAST queries against (already prepared) databases containing the sequences in ref (AFAIK you can filter the reference through blastp -gilist for example). This could return different protein nodes (and their sequences if needed) giving you the option of chaining it with other traversals etc. We also have the added advantage here given by UniRef which is basically a precomputed clustering of protein sequences at different levels of similarity, which could be used as an index or as a hash-like structure (first match only representatives then go further etc).

But anyway this won't use directly the DynamoDB+S3 component; it could use it for generating the BLAST dbs though.

Ignore the ElasticSearch part, but they seem to be loving LSH for this. I think it might be a great idea to do something like this with Spark for doing the sequence searches. Again, I might be completely off-track here, so do point me to the right direction in that case. :)

I like the basic idea behind LHS but it's not going to work here. The main issue is that gaps are key, and a model based only in substitutions is not going to work. The results here for example in comparison to BLAST are certainly not promising:

Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce

I could see something like this working for matching against a family of sequences with a low level of local entropy and pretty homogeneous length; but then in that case it would only be a preliminary step because after a first cluster-assignment phase so to say you'd need to actually look at finer differences; normally is there where the interesting biological aspects lie.

I don't know maybe @rtobes can say something here but protein similarity is kind of hard and there's the social factor: people already know what BLAST-like results mean, and how to tune it so that they can assign meaning to the results.

Could you elaborate a bit on this? Do you mean graph traversal during searching? Could you give me an example?

I mean for example

search for QCDN in all proteins
starting from those results, execute a graph traversal giving you proteins again as a result
with the output of that, search for other string GGSP in those results

ElasticSearch-like integration could work here.

Also, a broad question - roughly how much time would the users expect it to take for such sequence matching queries? < 1 sec/1-2 sec/50-60 sec/more?

Well the users would be basically APIs so not an issue; of course reasonable performance is a goal. Anyway in the bioinformatics space people are used to wait :)

HTH, all this is a pretty wide (but interesting) space. I'm really busy with grant proposals and stuff but we could have a skype call or something if you're interested, it could be more efficient

nilesh-c commented 10 years ago

Hi @eparejatobes

Sorry about not being able to respond sooner, got caught up with assignments.

Then, you could distribute the execution of BLAST queries against (already prepared) databases containing the sequences in ref

So I searched around about this, and seems like a few people have already used Hadoop to distribute BLAST over multiple nodes. Looks like they're simply storing the generated BLAST DBs on S3 (HDFS). What BLAST does: "Given a set of k sequences S={s1, ..., sk}, the program compares each sequence si to a database of n sequences D={ds1, ..., dsn} and calculates the statistical significance of matches." We can see that both D and S can be easily distributed, BLAST called from mappers, results aggregated using reducers, and scaling shouldn't be a problem.

Here are the papers I looked at:

CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications If you start reading from the last paragraph on pg. 2 you'll get a pretty good idea of the motivation for this. The speedup and scalability look good, but can be improved if we use the Java API instead of Hadoop Streaming+Python(calling BLAST from python).
Biodoop: Bioinformatics on Hadoop They also use a similar approach. You might want check out A. BLAST from pg. 4. The tail effect on their machine load graphs show that most probably their implementation is spending a lot of time on I/O - look at the bigger tail on 69 nodes vs 13 nodes. I think this is because of their use of Hadoop Streaming, which has additional overheads.
BlastReduce: High  Performance  Short  Read  Mapping  with  MapReduce Their abstract sounds impressive though I think the approach used in the previous two papers would be much better and safer for a number of reasons. These guys build their BLAST-based framework over MapReduce purely with Java.

I really like your idea of the API level integration, abstract away the distributed searching, DB generating etc. in the lower layer and let the user work with a seamless API to let him/her traverse the graph, get results, play with them, query on subsets etc.

The ElasticSearch use-case you just mentioned sounds a bit different than the above, that is to say it's much more simpler? I'm thinking on the lines of combining the two approaches (EC wouldn't take much trouble to setup I'm hoping, it's the distributed BLAST which will need some working and tweaking :)) and use each for a different kind of query respectively. Am I on the right track?

A Skype/IRC chat would be great. Are you available on the weekends? Are you on UTC+1:00? I live in India, on UTC+5:30.

Update: Since the datasets we're dealing with are in the GB/TB scale, and we need real-time queries, I think Spark (more recent MapReduce implementation that's been getting a lot of attention) might be a better fit for this than Hadoop and even result in better performance than the latter. Also, Spark's API pertains to easier data manipulation and analytics, faster code refactoring etc. as opposed to Hadoop's write-multiple-classes-for-a-single-job nuisance.

eparejatobes commented 10 years ago

Hey @nilesh-c

Yep CET = UTC+1 at this time of the year. Sunday is OK, mail me

ajmazurie commented 9 years ago

Hi folks, I am quite interested in this discussion, and was wondering if you came up with an idea about the best architecture for a BLAST-like query on a Bio4j (or Neo4j) hosted set of sequences. I am interested in exploring graphs of reads and contigs collected across samples, and querying those objects based on their sequences is high in our features request list.

Based on your discussion and my own exploration, I can see at least two solutions: (a) deploying a sequence alignment as a map/reduce job connected to a Neo4j database (b) storing sequences outside of the Neo4j store, and running a regular BLAST on the former

Solution (b) would work for sure, but (a) seems more elegant, if the current state of the technologies involved is advanced enough. What are your thought?

Best, Aurelien

eparejatobes commented 9 years ago

hey @ajmazurie sorry missed this. Check https://github.com/bio4j/gsoc15 and the gitter channel!

bio4j / gsoc14

Integrate sequence data into bio4j #8