Open nsheff opened 5 years ago
I understand that the use of segmentation, the associated annotations, and matching of a query set of regions to segmentation is efficient. But this complexity opens many questions about how to handle partial overlaps, like in this issue, etc.
Conceptually, each base can be considered as a segment that has (multiple) annotations (example - a base annotated as a SNP, a variant allele frequency, a chromatin state, etc.). This concept eliminates the need for partial overlap. For a query region, one calculates the proportion of bases matched to a particular annotation. So, a query region may be 100% matched with one annotation, 60% annotated with another, etc. Any partial overlap is simply expressed as a % of a given annotation. Disadvantages of using each base as a segment are obvious - huge space and computational inefficiency. But if we eliminate ~50% of the human genome that is low complexity, or the whole heterochromatin portion (~80%), this may be feasible and, imho, will simplify many things.
Originally posted by @oddodaoddo in https://github.com/databio/episb-provider/issues/15#issuecomment-450868502 (modified to be relevant to segmentations)
In a scenario where we are ~adding an experiment~ matching a segmentation , we could (and most likely will) have hundreds of thousands of ~annotation values~ regions that we want to match against segments within segmentations.
This means searching things in a large space. Even if a segmentation is a list of segments, we may be talking a segmentation that has millions of segments in it.
We also can have millions of lines in an experiment bed file. Each of these needs to be looked up against a segmentation.
We also have the added requirement of everything being split up and exposed only via REST APIs (which means hitting a server and incurring network traffic and possible latencies etc.).
There are a few ways to approach the above problem:
1) Ask for a segmentation in full and search it locally, in memory, while we are going through the experiment bed file. Advantage: reduced network loads, avoiding timeouts incurred by repeatedly hitting REST server. Disadvantages: memory requirement may be huge, we may need to convert sehgmentations internally to hash tables to speed up searches, what if we need to search multiple segmentations...? Also, search is internal and limited to machine running the "add experiment" API, unless we somehow federate/split the processing across a cluster.
2) Read experiment bed file line by line and ask elastic search to do the matching. Advantage: avoiding the disadvantages of approach 1) above. Disadvantages: huge network load, dealing with possuible network timeouts and outages.
3) Bulk search in elastic - send portions of the annotation bed file to elastic and figure out if there is a way to do a bulk search/retrieve. May be "middle of the road" approach to 1) and 2) above.
4) Use some kind of a distributed in-memory grid such as Elastic, sitting on top of Elastic (I have experience in this). Advantages: we can keep a lot of segmentations/annotations in distributed memory across an ignite cluster and search this distributed memory at local memory/cpu speeds. Touching elastic would be required only on writing into it or on a cache-miss. Disadvantages: it will take a few days to set up Ignite and get it going.
5) Any suggestions? What am I missing?
A typical region list we want to assign to a segmentation will have hundreds of thousands of regions. Uploading this file in zipped form wouldn't be too bad.
Most likely we do this actual match computation outside of elastic, but it's possible we could explore using elastic to facilitate. We could institute a file size limit if that becomes an issue (people uploading huge region sets).... but uploading a single, zipped bed file should not be an issue at all. We're talking about a few megabyte file, typically.
I think clearly we don't do option 2. We do option 1, and possibly option 3. I am not at all familiar with option 4, but my first impression is that it sounds like overkill for this problem.
Sure, it is your call. The DHS file that we derived a segmentation from had a few million lines. Uploading a file is not really an issue, it is what happens when someone wants to get the experiment added/matched to a segmentation. I am not sure how elastic acts in a bulk search but my thought is that it already has distributed searching capabilities - did not want to re-invent the wheel :)
Can it do something along the lines of my suggestion at the beginning of this thread?
I understand that the use of segmentation, the associated annotations, and matching of a query set of regions to segmentation is efficient. But this complexity opens many questions about how to handle partial overlaps, like in this issue, etc.
Conceptually, each base can be considered as a segment that has (multiple) annotations (example - a base annotated as a SNP, a variant allele frequency, a chromatin state, etc.). This concept eliminates the need for partial overlap. For a query region, one calculates the proportion of bases matched to a particular annotation. So, a query region may be 100% matched with one annotation, 60% annotated with another, etc. Any partial overlap is simply expressed as a % of a given annotation. Disadvantages of using each base as a segment are obvious - huge space and computational inefficiency. But if we eliminate ~50% of the human genome that is low complexity, or the whole heterochromatin portion (~80%), this may be feasible and, imho, will simplify many things.
Your approach is useful I think as a separate thing. This would be one way to 'annotate' a particular region (or region set). I think that's a totally legitimate use of episb and should be one of our query types -- in fact, I think your comment really belongs in this issue: https://github.com/databio/episb-hub/issues/6. This would be one approach to that query.
But to me, this doesn't solve every possible use of episb... For example, what about these other query types:
If data is stored only in raw form with annotations as you suggest, these types of queries will become intractable.
Also, in my mind, there's no reason the space of existing segmentations can't become complex enough to almost perfectly approximate the total set of regions, while dramatically improving efficiency. Really, this is what we do anyway when we need to analyze a signal in a shared coordinate system.
As proposed in #9 and raised in #18, we need a 'match' function that will give you Segmentation elements for a query region set.
It's relatively straightforward for an exact match, but if matches aren't exact, we need to develop some heuristics to map across regions. Here are a few thoughts/ideas:
Match score for individual region
We seek to develop a score, call it m, that measures the level of match between two regions, query region A and candidate matching segment B. We could say a match requires a minimum of x% overlap, or minimum absolute base overlap (these could be parameters). This will need to be reciprocal -- the x% of query region A overlapping candidate Segment B, and the y% of candidate Segment B that overlaps query region A.
A reciprocal score m could then by x times y, which would measure the total reciprocal overlap. I think multiplying is better than adding because I'd prefer
.6 * .6
(.36
vs1.2
) over1 * .3
(.3
vs1.3
) in terms of matching (addition would favor matches that are wholly contained one way or another). But in fact, we can do something likem = ax + by + cxy
and then tweak the coeffiencients, which would emphasize different match characteristics.I wonder if we could learn optimal coefficients by some kind of subsampling or permutation test, at least for a particular data type.
Return either the best match, or all that meet the requirements for further filtering
Match score extended to region set
We could sum these individual m values for each region in a set, and then compare these sums between competing Segmentations. But strength of a match is a combination of number of regions matching, plus score for each individual match. So again, some kind of combined metric might be useful here.
Could we use information from other Regions in a given Segmentation to inform the result for a particular Region? We can use the database of Segmentations to guide matches; for example, if one segmentation has lots of matches, and has a segment that hits the number 2 spot for a particular region, we might assign it to that one even if there is a better hit, to maintain consistency within segmentations.
Implementation details
Should the match function happen inside the database, or outside? If it gets really complicated, it might need to be done by external software.
Flexible segments
What if we introduce the concept of Flexible Segments as, instead of having a
start
andend
coordinate, they havestart
andend
ranges? This could be analogous to a confidence interval vs. a point estimate. A canonical representation of a flexible segment could simply use the midpoint of the start range and the midpoint of the end range. This would facilitate flexible matching, and let's be honest, for many region types we're not entirely certain of the exact boundary. Computing a match score m on a flexible segment would need to be done a bit differently.