fjug / MoMA

MoMA - the MotherMachine Analyzer
4 stars 7 forks source link

idea: filter reliable data without manual curation #22

Open julou opened 8 years ago

julou commented 8 years ago

would it be possible to use a score for each segment/assignment to identify a subset of the lineage that has a very high probability to be correct. If the fraction of the data extracted is large enough (and without too frequent breaks along branches), we could skip the systematic curation.

erikvannimwegen commented 8 years ago

Explanation: This is an issue with potentially very high pay-off, but it is also completely unclear to us how feasible this is. However, if there was some way to estimate, based on the scores that MoMA uses internally, and maybe even comparing with curated datasets, when there are parts of movies that are almost guaranteed to be 'error free', then one could run MoMA in a mode where, for each GC, MoMA only outputs the stats for those parts of the movie that it feels very confident are correct. This would allow users which have lots of data, to attempt to run without interactive curation. If one has lots of data, then just keeping the confident parts, even when this amounts to throwing away most of the data, might be very attractive to users.

fjug commented 7 years ago

It is a very nice idea, but unfortunately hard to realize. Imagine we could determine which parts of our solution is correct (with high prob.). Then I can device an algorithm that takes an area that is likely wrong, and generates a different solution then the current one. This i repeat until the predictor of correctness is satisfied and I end up having a correct over all solution (with high prob.). The existence of this simple algorithms essentially proofs that the desired 'reliable data filter' is at least as hard to build then a tracker finding the correct solution in the first place. Does my argument make sense to you?

erikvannimwegen commented 7 years ago

I understand the argument but it is simply mistaken. For a typical movie, there are going to be parts that are more unambiguous and parts that are more ambiguous. Very roughly speaking, I would expect that in the unambiguous parts the optimal scoring solution has much better score than any other solution. That's what makes it an unambiguous part of the movie. In contrast, in parts of the movie that are more ambiguous, there are going to be more solutions with similar scores. And probably no with very high score. So the general idea is that, if for a particular segment of the movie, the score is very high, then you can be very confident there is no problem in that segment. In contrast, if you find a segment with so-so score, then it is much more likely that there may be errors, but that doesn't help you find a better solution for that segment. Actually, these ideas are really very simple for things like making sequence alignments. If you align two sequences and you find a long segment where the sequences are just perfectly identical, i.e. no gaps and no mismatches, then you are of course very confident the alignment is correct in that area. Whereas in an area with many gaps and mismatches you will feel much less confident, but that doesn't help you find a better alignment. In fact it is very common for people to make big alignments and then, for getting some statistics, focus on the parts that are very likely correct. So these kinds of things are not unheard of at all.

erikvannimwegen commented 7 years ago

But don't get me wrong. I can understand this is hard. I just wanted to put it on your radar screen.

julou commented 7 years ago

let me check I'm still on board: Erik, when you write "segment" you mean subtrees rather than time windows, correct? because while it would be helpful would be to automatically discard/prune those segments / assignments that are uncertain (and their progeny), it would not be useful if we end up with several short time series; btw this means than any uncertainty on the bottom cell will need to be checked manually…