Open MartinThoma opened 9 years ago
Segmentation by strokes seems to be possible for mathematical context.
Be aware that the order of strokes is not necessarily ordered (e.g. enlarging fraction strokes)
For http://www.martin-thoma.de/write-math/view/?raw_data_id=151305 the classifier predicted the segmentation [[0,4],[1,2,5,6],[3]]
:
The yellow and black segmentation does not make any sense, as it would mean there is a symbol between strokes of one symbol. This should be part of the score of a segmentation.
Real segmentation: [[0, 2, 16], [1], [3, 4], [5, 6], [7, 8], [9, 10, 11], [12, 13], [14, 15]]
(got at place -1)
Predict segmentation: [[0, 4, 6], [1, 2, 3, 5], [7], [8, 10, 11], [9, 15], [12, 13, 14], [16]]
(0.00000000)
[157, 2, 498, 59, 14, 33, 4, 4, 3, 1, 4, 14, 13, 0, 0, 2, 11, 277, 1, 1, 1, 1, 5, 4, 74, 1, 1, 1, 4, 2, 14, 18, 97, 1, 12, 2, 0, 1, 11, 2, 49, 1, 197, 12, 49, 2, 13, 360, 6, 5, 14, 21, 5, 15, 3, 7, 0, 9, 326, 123, 39, 7, 1, 3, 23, 4, 1, 12, 264, 63, 14, 4, 1, 2, 3, 6, 146, 8, 0, 13, 406, 3, 0, 1, 1, 3, 157, 0, 0, 158, 46, 6, 1, 33, 2, 1, 1, 0, 0, 4, 1, 4, 1, 0, 1, 5, 3, 4, 3, 0, 10, 1, 73, 11, 0, 1, 26, 9, 3, 0, 240, 1, 2, 27, 7, 1, 4, 8, 105, 0, 306, 98, 0, 1, 10, 57, 122, 0, 12, 2, 0, 485, 0, 29, 2, 17, 1, 2, 1, 312, 1, 44, 10, 13, 0, 23, 1, 37, 406, 4, 0, 2, 2, 4, 3, 0, 4, 3, 9, 0, 0, 2, 10, 20, 4, 21, 105, 0, 3, 0, 19, 0, 88, 3, 1, 3, 0, 1, 1, 13, 1, 20, 0, 1, 4, 5, 22, 1, 1, 36, 7, 0, 1, 2, 85, 38, 1, 9, 62, 2, 3, 1, 25, 6, 0, 1, 1, 0, 1, 0, 26, 9, 0, 127, 0, 0, 0, 2, 1, 10, 0, 9, 0, 7, 1, 0, 12, 2, 0, 60, 15, 0, 155, 0, 0, 14, 7, 2, 1, 0, 1, 0, 1, 15, 0, 0, 3, 12, 11, 23, 34, 0, 1, 1, 0, 2, 1, 189, 84, 0, 10, 0, 4, 3, 0, 0, 450, 0, 12, 163, 10, 3, 7, 1, 0, 4, 0, 113, 3, 49, 0, 1, 26, 1, 0, 4, 3, 49, 11, 1, 0, 0, 1, 95, 1, 76, 76, 1, 1, 1, 4, 12, 0, 0, 0, 0, 30, 17, 2, 0, 16, 15, 497, 13, 360, 181, 0, 1, 49, 54, 13, 1, 4, 2, 45, 1, 7, 1, 4, 0, 0, 2, 2, 4, 114, 8, 1, 3, 199, 1, 1, 0, 7, 0, 2, 212, 9, 10, 178, 1, 31, 1, 1, 41, 0, 0, 207, 1, 1, 173, 4, 6, 4, 1, 4, 1, 65, 0, 0, 3, 1, 184, 1, 8, 1, 5, 4, 0, 13, 10, 1, 1, 148, 2, 42, 2, 7, 1, 16, 0, 49, 1, 38, 1, 289, 3, 14, 37, 0, 0, 24, 133, 14, 4, 0, 8, 0, 0, 47, 1, 4, 1, 1, 3, 43, 104, 2, 0, 1, 28, 2, 379, 12, 57, 1, 79, 3, 7, 14, 499, 4, 2, 1, 241, 0, 0, 2, 0, 1, 0, 0, 3, 178, 52, 2, 0, 156, 269, 0, 0, 2, 3, 0, 4, 0, 51, 60, 166, 0, 0, 1, 11, 6, 3, 5, 77, 0, 0, 7, 0, 10, 4, 25, 0, 5, 96, 0, 317, 85, 7, 101, 0, 19, 332, 23, 13, 116, 37, 13, 1, 13, 0, 1, 26, 192, 7, 4, 50, 1, 2, 0, 0, 1, 86, 2, 127, 347, 0, 16, 0, 3, 7, 223, 309, 13, 36, 0, 1, 1, 1, 1, 1, 3, 4, 258, 0, 384, 0, 11, 1, 495, 317, 4, 2, 9, 202, 9, 87, 3, 265, 6, 146, 4, 173, 12, 1, 4, 103, 21, 33, 3, 13, 0, 352, 330, 51, 334, 3, 6, 0, 34, 1, 0, 0, 6, 1, 8, 8, 9, 9, 9, 0, 1, 1, 191, 1, 1, 0, 10, 22, 8, 175, 0, 1, 0, 14, 376, 1, 24, 17, 1, 26, 164, 1, 0, 5, 27, 1, 196, 0, 0, 31, 11, 494, 0, 0, 4, 3, 0, 13, 10, 47, 3, 0, 386, 11, 3, 0, 0, 0, 3, 3, 8, 0, 1, 2, 0, 30, 4, 10, 3, 10, 304, 33, 0, 1, 1, 1, 12, 0, 39, 1, 0, 0, 0, 0, 0, 1, 0, 3, 36, 0, 3, 20, 11, 4, 0, 0, 194, 347, 98, 0, 140, 445, 0, 3, 0, 3, 2, 182, 0, 1, 2, 4, 20, 1, 167, 13, 1, 0, 1, 0, 200, 0, 4, 12, 0, 68, 14, 3, 0, 0, 0, 50, 0, 4, 10, 35, 49]
2015-04-12 19:49:54,171 INFO mean: 37.82
2015-04-12 19:49:54,172 INFO median: 3.00
2015-04-12 19:49:54,172 INFO TOP-1: 0.16
2015-04-12 19:49:54,172 INFO TOP-3: 0.33
2015-04-12 19:49:54,173 INFO TOP-10: 0.48
2015-04-12 19:49:54,173 INFO TOP-20: 0.56
2015-04-12 19:49:54,173 INFO TOP-50: 0.63
2015-04-12 19:49:54,173 INFO Out of order: 72
2015-04-12 19:49:54,173 INFO Total: 943
Idea:
Search single-stroke symbols which are not "prefixes" of other symbols. For example. o
is a prefix of \oplus
or l
is a prefix of t
; 1
and 3
are prefixes of B
. Those symbols should also be "stable" in the sense that their variance in stroke count is low. Let B be the group of those basic symbols. Examples of this group might be \sum
, \int
, (
, )
, \infty
, 2
, 6
, 8
, 9
0
, a
, c
, e
, ...
91 of 374 symbols have a mean stroke count in [1,1.1] with a variance < 0.3. The variance is very likely caused by wild points.
Let L be the list of strokes which should be partitioned into a list of list of strokes, where each sublist is one symbol.
Idea: Use MST as described in http://www.ai.mit.edu/projects/natural-log/papers/matsakis-MEng-99.pdf#page=37
Ideas:
ħ
is essentially a h
, but the bar is added later:
and .
(j
, i
are not so much a problem, because there is no symbol with with an j without dot can get confused)Other problems:
\doteq
vs \stackrel{!}{=}
Some first results by prob_sum > 0.95 and not any([el for el in bbintersections[i]]) and len(stroke) > 2 and predictions[0]['semantics'].split(';')[1] != '-'
2015-05-10 20:00:41,780 INFO TOP-1: 0.27
2015-05-10 20:00:41,782 INFO Over-segmented: 0.21
2015-05-10 20:00:41,782 INFO Under-segmented: 0.66
2015-05-10 20:00:41,782 INFO overUnder-segmented: 0.14
2015-05-10 20:00:41,782 INFO Out of order: 113
2015-05-10 20:00:41,782 INFO Total: 1856
2015-05-10 20:20:14,914 INFO TOP-1: 0.25
2015-05-10 20:20:14,915 INFO Over-segmented: 0.20
2015-05-10 20:20:14,915 INFO Under-segmented: 0.69
2015-05-10 20:20:14,916 INFO overUnder-segmented: 0.13
2015-05-10 20:20:14,916 INFO Out of order: 113
2015-05-10 20:20:14,916 INFO Total: 1856
2015-05-10 20:26:53,781 INFO TOP-1: 0.20
2015-05-10 20:26:53,782 INFO Over-segmented: 0.16
2015-05-10 20:26:53,782 INFO Under-segmented: 0.76
2015-05-10 20:26:53,782 INFO overUnder-segmented: 0.12
2015-05-10 20:26:53,782 INFO Out of order: 113
2015-05-10 20:26:53,783 INFO Total: 1856
Classify symbols stroke-by-stroke. Lets say there are n symbols in total.
complete_data
, the new entered stroke new_data
, the old list of strokes old_data
(old_data+new_data = complete_data
) as well as a list of classification hypotheses old_segmentation_hyp
(which is a list of dictionaries with keys probability
and segmentation
. segmentation
is a list of dictionaries. Those dictionaries contain the keys symbol
, strokes
and probability
)Now one of two things could have happened:
For new symbols, it is more likely that it is at the very right side of the bounding box of complete_data
. It is also very likely that the new stroke did not cross any old stroke.
If the new stroke belongs to an existing symbol, then then it has to be "next" to the symbol. "next" in this context means there is not another symbol in between (though it might be around, e.g. sqrt
).
## 325119
Real segmentation: [[0, 1], [2, 3]] (got at place -1)
Predict segmentation: [[0], [1], [2, 3]] (1.00000000)
Segmentation took 0.0561 seconds.
over-segmented
## 325266
Real segmentation: [[0], [1, 2], [3], [4, 5]] (got at place -1)
Predict segmentation: [[0], [1, 2], [3], [4], [5]] (1.00000000)
Segmentation took 0.0890 seconds.
over-segmented
## 325283
Real segmentation: [[0, 1], [2]] (got at place -1)
Predict segmentation: [[0, 2], [1]] (1.00000000)
Segmentation took 0.0451 seconds.
over-segmented
under-segmented
## 325540
Real segmentation: [[0], [1, 2], [3, 4], [5], [6]] (got at place -1)
Predict segmentation: [[0, 6], [1, 2, 3], [4], [5]] (1.00000000)
Segmentation took 0.0978 seconds.
over-segmented
under-segmented
## 325545
Real segmentation: [[0], [1], [2, 3], [4], [5], [6, 7], [8]] (got at place -1)
Predict segmentation: [[0], [1, 2], [3], [4], [5], [6], [7, 8]] (1.00000000)
Segmentation took 0.1346 seconds.
over-segmented
under-segmented
## 2800 ################################################################################
## 325682
Real segmentation: [[0, 1], [2]] (got at place -1)
Predict segmentation: [[0], [1], [2]] (1.00000000)
Segmentation took 0.0481 seconds.
over-segmented
## 325689
Real segmentation: [[0, 1], [2, 3], [4], [5, 6]] (got at place -1)
Predict segmentation: [[0, 1, 2, 3], [4], [5], [6]] (1.00000000)
Segmentation took 0.1113 seconds.
over-segmented
under-segmented
## 325792
Real segmentation: [[0], [1, 2], [3], [4, 5], [6], [7], [8, 9], [10]] (got at place -1)
Predict segmentation: [[0], [1], [2], [3], [4, 5], [6], [7, 10], [8, 9]] (1.00000000)
Segmentation took 0.1735 seconds.
over-segmented
under-segmented
## 325915
Real segmentation: [[0, 1], [2]] (got at place -1)
Predict segmentation: [[0], [1], [2]] (1.00000000)
Segmentation took 0.0672 seconds.
over-segmented
2015-06-07 19:39:39,710 INFO mean: 0.00
2015-06-07 19:39:39,711 INFO median: 0.00
2015-06-07 19:39:39,711 INFO TOP-1: 0.20
2015-06-07 19:39:39,711 INFO TOP-3: 0.20
2015-06-07 19:39:39,712 INFO TOP-10: 0.20
2015-06-07 19:39:39,712 INFO TOP-20: 0.20
2015-06-07 19:39:39,712 INFO TOP-50: 0.20
2015-06-07 19:39:39,712 INFO Over-segmented: 0.12
2015-06-07 19:39:39,712 INFO Under-segmented: 0.76
2015-06-07 19:39:39,712 INFO overUnder-segmented: 0.08
2015-06-07 19:39:39,712 INFO Out of order: 172
2015-06-07 19:39:39,713 INFO Total: 2838
Top 2 > 0.95 instead of top 3:
2015-06-08 20:55:56,451 INFO mean: 0.00
2015-06-08 20:55:56,465 INFO median: 0.00
2015-06-08 20:55:56,466 INFO TOP-1: 0.18
2015-06-08 20:55:56,466 INFO TOP-3: 0.18
2015-06-08 20:55:56,466 INFO TOP-10: 0.18
2015-06-08 20:55:56,466 INFO TOP-20: 0.18
2015-06-08 20:55:56,467 INFO TOP-50: 0.18
2015-06-08 20:55:56,467 INFO Over-segmented: 0.09
2015-06-08 20:55:56,467 INFO Under-segmented: 0.79
2015-06-08 20:55:56,467 INFO overUnder-segmented: 0.07
2015-06-08 20:55:56,467 INFO Out of order: 172
2015-06-08 20:55:56,468 INFO Total: 2838
Top 4 > 0.95 instead of top 3:
2015-06-08 21:01:28,281 INFO mean: 0.00
2015-06-08 21:01:28,282 INFO median: 0.00
2015-06-08 21:01:28,282 INFO TOP-1: 0.22
2015-06-08 21:01:28,282 INFO TOP-3: 0.22
2015-06-08 21:01:28,282 INFO TOP-10: 0.22
2015-06-08 21:01:28,282 INFO TOP-20: 0.22
2015-06-08 21:01:28,283 INFO TOP-50: 0.22
2015-06-08 21:01:28,283 INFO Over-segmented: 0.14
2015-06-08 21:01:28,283 INFO Under-segmented: 0.73
2015-06-08 21:01:28,283 INFO overUnder-segmented: 0.09
2015-06-08 21:01:28,283 INFO Out of order: 172
2015-06-08 21:01:28,283 INFO Total: 2838
Top 5 > 0.95 instead of top 3:
2015-06-08 21:07:36,789 INFO mean: 0.00
2015-06-08 21:07:36,790 INFO median: 0.00
2015-06-08 21:07:36,790 INFO TOP-1: 0.23
2015-06-08 21:07:36,791 INFO TOP-3: 0.23
2015-06-08 21:07:36,791 INFO TOP-10: 0.23
2015-06-08 21:07:36,791 INFO TOP-20: 0.23
2015-06-08 21:07:36,791 INFO TOP-50: 0.23
2015-06-08 21:07:36,791 INFO Over-segmented: 0.16
2015-06-08 21:07:36,791 INFO Under-segmented: 0.71
2015-06-08 21:07:36,791 INFO overUnder-segmented: 0.09
2015-06-08 21:07:36,792 INFO Out of order: 172
2015-06-08 21:07:36,792 INFO Total: 2838
Top 6 > 0.95 instead of top 3:
2015-06-08 21:14:14,411 INFO mean: 0.00
2015-06-08 21:14:14,412 INFO median: 0.00
2015-06-08 21:14:14,412 INFO TOP-1: 0.24
2015-06-08 21:14:14,412 INFO TOP-3: 0.24
2015-06-08 21:14:14,412 INFO TOP-10: 0.24
2015-06-08 21:14:14,413 INFO TOP-20: 0.24
2015-06-08 21:14:14,413 INFO TOP-50: 0.24
2015-06-08 21:14:14,413 INFO Over-segmented: 0.17
2015-06-08 21:14:14,413 INFO Under-segmented: 0.69
2015-06-08 21:14:14,413 INFO overUnder-segmented: 0.10
2015-06-08 21:14:14,413 INFO Out of order: 172
2015-06-08 21:14:14,413 INFO Total: 2838
Top 10 > 0.95 instead of top 3:
2015-06-08 21:19:49,930 INFO mean: 0.00
2015-06-08 21:19:49,931 INFO median: 0.00
2015-06-08 21:19:49,931 INFO TOP-1: 0.25
2015-06-08 21:19:49,931 INFO TOP-3: 0.25
2015-06-08 21:19:49,932 INFO TOP-10: 0.25
2015-06-08 21:19:49,932 INFO TOP-20: 0.25
2015-06-08 21:19:49,932 INFO TOP-50: 0.25
2015-06-08 21:19:49,932 INFO Over-segmented: 0.19
2015-06-08 21:19:49,932 INFO Under-segmented: 0.67
2015-06-08 21:19:49,932 INFO overUnder-segmented: 0.10
2015-06-08 21:19:49,932 INFO Out of order: 172
2015-06-08 21:19:49,933 INFO Total: 2838
Top 20 > 0.95 instead of top 3:
2015-06-08 22:33:33,301 INFO mean: 0.00
2015-06-08 22:33:33,302 INFO median: 0.00
2015-06-08 22:33:33,302 INFO TOP-1: 0.26
2015-06-08 22:33:33,302 INFO TOP-3: 0.26
2015-06-08 22:33:33,303 INFO TOP-10: 0.26
2015-06-08 22:33:33,303 INFO TOP-20: 0.26
2015-06-08 22:33:33,303 INFO TOP-50: 0.26
2015-06-08 22:33:33,303 INFO Over-segmented: 0.20
2015-06-08 22:33:33,303 INFO Under-segmented: 0.66
2015-06-08 22:33:33,303 INFO overUnder-segmented: 0.11
2015-06-08 22:33:33,303 INFO Out of order: 172
2015-06-08 22:33:33,304 INFO Total: 2838
./segmentation.py 284,03s user 3,10s system 96% cpu 4:56,26 total
All instead of cropping at top 3:
2015-06-08 22:46:18,547 INFO mean: 0.00
2015-06-08 22:46:18,548 INFO median: 0.00
2015-06-08 22:46:18,549 INFO TOP-1: 0.27
2015-06-08 22:46:18,549 INFO TOP-3: 0.27
2015-06-08 22:46:18,549 INFO TOP-10: 0.27
2015-06-08 22:46:18,549 INFO TOP-20: 0.27
2015-06-08 22:46:18,550 INFO TOP-50: 0.27
2015-06-08 22:46:18,550 INFO Over-segmented: 0.20
2015-06-08 22:46:18,550 INFO Under-segmented: 0.65
2015-06-08 22:46:18,550 INFO overUnder-segmented: 0.11
2015-06-08 22:46:18,550 INFO Out of order: 172
2015-06-08 22:46:18,552 INFO Total: 2838
./segmentation.py 299,81s user 3,90s system 95% cpu 5:18,39 total
top 1 > 0.95 instead of top 3:
2015-06-08 23:04:08,941 INFO mean: 879844.96
2015-06-08 23:04:08,942 INFO median: 1000000.00
2015-06-08 23:04:08,942 INFO TOP-1: 0.12
2015-06-08 23:04:08,943 INFO TOP-3: 0.12
2015-06-08 23:04:08,943 INFO TOP-10: 0.12
2015-06-08 23:04:08,943 INFO TOP-20: 0.12
2015-06-08 23:04:08,943 INFO TOP-50: 0.12
2015-06-08 23:04:08,944 INFO Over-segmented: 0.01
2015-06-08 23:04:08,944 INFO Under-segmented: 0.88
2015-06-08 23:04:08,944 INFO overUnder-segmented: 0.01
2015-06-08 23:04:08,944 INFO Out of order: 172
2015-06-08 23:04:08,944 INFO Total: 2838
All
2015-06-08 23:12:56,249 INFO mean: 734672.30
2015-06-08 23:12:56,250 INFO median: 1000000.00
2015-06-08 23:12:56,250 INFO TOP-1: 0.27
2015-06-08 23:12:56,250 INFO TOP-3: 0.27
2015-06-08 23:12:56,251 INFO TOP-10: 0.27
2015-06-08 23:12:56,251 INFO TOP-20: 0.27
2015-06-08 23:12:56,251 INFO TOP-50: 0.27
2015-06-08 23:12:56,251 INFO Over-segmented: 0.20
2015-06-08 23:12:56,251 INFO Under-segmented: 0.65
2015-06-08 23:12:56,251 INFO overUnder-segmented: 0.11
2015-06-08 23:12:56,252 INFO Out of order: 172
2015-06-08 23:12:56,252 INFO Total: 2838
What about using a varying size bounding box ? Physically & historically, each symbol are separated because else it would become unreadable. I think it's a good idea to compute a bounding box while a symbol is being drawn whose size is adjusted by the current stroke length. When the pen is lifted, one has to assume to either "start a new symbol" or "concatenate existing symbol". This decision can be based on a bounding box whose shape can be trained, but with a size that's dynamic. For example, when drawing the "integral" symbol, the bounding box is an horizontally thin and vertically tall rectangle that would be recognized by the network, but the size itself is extracted from the current symbol's size. When a new stoke is produced, and it intersects a previous stroke's bounding box, you can likely concatenate it with the previous symbol.
This is an example for segmentation:
Each symbol is in one color, different colors mean different symbols.
Features
A classifier which has to decide for two given strokes if they are in one symbol could use the following features:
Problems