GMOD / Apollo

Genome annotation editor with a Java Server backend and a Javascript client that runs in a web browser as a JBrowse plugin.
http://genomearchitect.readthedocs.io/
Other
126 stars 85 forks source link

Projection v3 #950

Closed nathandunn closed 8 years ago

nathandunn commented 8 years ago

Related to changes from #673.

image

image

nathandunn commented 8 years ago

Branches to merge in to get everything working:

nathandunn/

GMOD/

nathandunn commented 8 years ago

Could be done in HTMLFeatures (or subclass), but all of the edge-cases are already solved with the current implementation. There is no reason to not allow a mixture of both either.

nathandunn commented 8 years ago

was looking at padding .. but then I realized we had to look at folding as well and we could have multiple 'paddings' and folding and better to let the library do any types of these calculations

nathandunn commented 8 years ago

So I split these up into the top being "probably" and the bottom being ideal. Since this is done over two tracks, we just need to make sure that the tracks have some sort of awareness of each other.

image

nathandunn commented 8 years ago

Not entirely what we want, but good enough for screen shots with edge detection

screen shot 2016-05-02 at 1 06 21 pm screen shot 2016-05-02 at 1 06 15 pm
cmdcolin commented 8 years ago

@nathandunn I brought up during meeting that we didn't recall having a scheme to annotate across scaffold boundaries yet. Is that correct?

nathandunn commented 8 years ago

The scheme we'd agreed to awhile ago was based on the following use-cases:

  1. If we have a transcript with exons split across multiple scaffolds, the each exon feature has a single feature location for its respective scaffold.
  2. If we have an exon which is split across multiple scaffolds, the exon then has two feature locations, one for each side of the split.

The only thing I would be unsure about is if the transcript / genes then naturally inherit multiple feature locations or simply infer from their respective sub-features. My intuition is that the former would be easier.

cmdcolin commented 8 years ago

In my view, a good reference for example is this Chado feature location graph:

http://gmod.org/wiki/File:Featureloc-graph-example.png

I think it is better to get the annotation editor functions to just use the "virtual scaffold" (e.g. the group 1 feature locs from that image) rather than try to reason about all the individual different feature locations that are sprawled across different scaffolds (the group 0 feature locs from that image). It might be an interesting problem to "synchronize" them, but to even get basic functionality, the annotation editing reasoning should be based on the "virtual scaffold"

nathandunn commented 8 years ago

The way I'm implementing projection is a virtualization of 1 or more scaffolds. However, everything is stored on unprojected scaffolds so that they will show up in other views readily. If we created a virtual scaffold (that sat in its own space) I would have to infer it back anyway if you reconstruct it. However, feel free to generate combined scaffolds if that is your use-case.

WRT graphs, its something to keep in mind and we might end up doing it down the road, but I want to include things like variants, alternate sequences, etc.

cmdcolin commented 8 years ago

I guess I just wanted to see what the status was because I didn't see any items and our meeting suggested this was finished when I think it deserves consideration. I understand that there are different use cases for both the "group 1" approach i mentioned and the "group 0" approach, but I'd also assert that doing things properly in the group 0 scenario is difficult, being that it introduces multiple feature locations per object (whether than object is a gene, transcript, or exon). If we only store feature locs on the exons, that is complicated in one way, and if we end up with multiple sequence locations for a gene or transcript in another, and in either case, we have some new challenges to solve

cmdcolin commented 8 years ago

And as I mentioned, there are alternatives: just use the virtual space, or make the virtual space some smaller new thing like the alt loci patches. For example the UCSC browser's usage of alt loci says "Note that annotations from the standard browser display do not extend into or out of the haplotype region, since haplotypes are annotated separately from regular chromosomes." http://genome.ucsc.edu/goldenPath/help/multiRegionHelp.html#Haplotype

That implies that there is not even really a problem with things having multiple feature locations there because they accept that limitation

nathandunn commented 8 years ago

I'm not sure what you're proposing. When we'd talked about this before, what I outlined was what we were going forward with. I'll see how far we get with it and we can consider alternatives if it runs into any significant snags. If you want to do something else for the LCSA(sp?) I don't think that this solution precludes that.

cmdcolin commented 8 years ago

I don't have any proposal currently. I am basically just raising the concern that this still remains unclarified AFAIK. The most thorough time that we reviewed this appears to come from a meeting "Fwd: Apollo conference call notes - 12.03.2015"

This email thread is good but it actually is still really only looking at the object model. Our code still needs review to see if any of those suggestions can be added to the annotation functions because the annotation functions of course contain lots of embedded assumptions

nathandunn commented 8 years ago

I think the easiest way to clarify it would be to implement and see if we run into any show stoppers as I'm certain there are multiple ways to do this and get good results. Currently I have not run into show stoppers, so I'll keep going this way. Its going to be a lot of work, regardless.

nathandunn commented 8 years ago

Looked at 4 use-cases: 1 - Multi-scaffold transcript crossing, exons not. Transcript / Genes are defined using two FeatureLocations, where the max and min are the max and min of the scaffolds where they cross. Exons are defined with a single feature location. 2 - Multi-scaffold transcript crossing where exon is crossing. Same as above, but the FeatureLocation is defined in two places.
3 - Multi-scaffold, single gene projected regions merged. The Transcript / Feature-Location is merged to the max /min of the other scaffold after discussion with @selewis . This is good as I think it would have been rather ambiguous otherwise. 4 - Multi-scaffold, single gene projected regions merged across an exon. The transcript max / min region is defined at the point of the exon in the scaffold, and so is well-defined. The exon has two feature locations.

nathandunn commented 8 years ago

Converted to single issues.