Closed ghost closed 4 years ago
Hi deepseek--this is an interesting question!
If I were to approach this problem using our model out-of-the-box, I would do something similar to our rendering application. That is, I would use the saliency maps output by our model to calculate the cumulative heatmap density that falls within each dialog bubble at each duration, and order the bubbles based on when they accumulate a certain amount of heatmap density. If you wanted more granularity than three time steps, you could either a) increase the amount of LSTM cells in the model to produce more outputs, or b) interpolate between the three maps given (that way, you would not need 40 output durations to order 40 bubbles).
However, as you intuit, there are some issues with this approach. For one, we collected our data and trained our models using a "free-viewing" task where we asked people to explore a natural image without giving them a specific task. Someone reading a comic book has "the task" of reading the text bubbles, which will probably change their exploration patterns. For example, there is a convention that you read text left-to-right, so high level processes might direct attention in a left-to-right, top-to-bottom direction. Also, our data is collected on natural images, so the saliency of the text may not be accurately measured. There has been work looking at the saliency of graphs and visualizations (see here and here) and in another paper we used CodeCharts to collect data on graphic designs. It would be interesting to train a multi-duration saliency model on graphic designs!
One application that would be enabled by our model is suggesting where to place text bubbles or elements of the artwork so that a reader's attention is naturally drawn to the bubbles in the correct order. For example, putting an object that is salient at 0.5 seconds next to the third text bubble might inadvertently cause readers to read the text in the wrong order. Our model could be used to try to align the multi-duration saliency maps of the artwork itself with the ordering of the bubbles.
Using the Rendering approach, isn't it simpler to just segment a single dialog at each duration. Thus creating a dialog sequence of the reading-order.
Does your model take into consideration the previously predicted semantic segments. So after predicting the first dialog, the second dialog can be predicted without the first. Single class with multiple duration.
The inferencing performance, how long it would take to predict Renderings per duration, also how long the total time. for both GPU and CPU inferencing.
How to increase the amount of LSTM cells.
How to interpolate between the three maps given.
You could segment a single element at each duration, but you would have to support as many durations as there are elements you want to segment. Since our model was trained on human data, it does not guarantee that objects/regions that were viewed at one duration will not be viewed again at a later duration.
does this mean that the trained model is some form of "imitation learning"? also, answering the other questions would be great
Please see our paper for details on the model and training: http://multiduration-saliency.csail.mit.edu/documents/multiduration_saliency.pdf
Hi there,
Take this example, i want to predict the order of the dialog in a comic book, basically telling which dialog box is 1st, which is 2nd, 3rd, etc...