Closed ghost closed 4 years ago
Hi @deepseek, thanks for your interest in our work :)
Correct me if I'm wrong, but the task you're proposing seems to me conceptually different from the aim of our architecture. You want to assign a numeric label to each box given the boxes coordinates, their content and maybe some visual features, while our architecture learns to generate plausible brand new examples (in the very same space of the input) learning the distribution that underlies the data.
Surely your task can be solved by employing RNNs and (possibly) Attentive or Graph Neural Networks: the task seems like something more on the Natural Language Processing side, and the former solutions have been extensively employed in that field; nevertheless DAG-Net, especially in its generative part (Recurrent VAE), doesn't suit your purpose.
Alex
The task might seem unrelated at first, but if you think about it, the task is basically tell which dialog box is next in the order. Using trajectory prediction allows drawing a line from the first till the last dialog, basically training the task as a trajectory prediction.
do you recommend any specific implementation of the networks that you suggested. Note that this is NOT a Natural Language Processing, it's predicting the structure or order of things.
Note that this is NOT a Natural Language Processing, it's predicting the structure or order of things.
Uh sorry, I gave for granted you were talking about giving the network also the boxes contents and extracting their order from this information. That's why I brought up NLP :)
The task might seem unrelated at first, but if you think about it, the task is basically tell which dialog box is next in the order. Using trajectory prediction allows drawing a line from the first till the last dialog, basically training the task as a trajectory prediction.
If you limit to the locations and xy coordinates of the boxes inside the page, the network could maybe come up with some results: it's all about seeing if the boxes locations are well characterized by a given distribution, it's hard to tell a priori. Without any particular experience, I would say that if the network succeeded in extracting some meaningful info, it would cope only with naive distributions / orders (left to right, up to bottom, as we would normally read the boxes across subsequent panels) and struggle with more complex layouts.
@alexmonti19 Thanks for your hard work!
Take this example, i want to predict the order of the dialog in a comic book, basically telling which dialog box is 1st, which is 2nd, 3rd, etc...
Note that i already have the dialog boxes location and coordinated detected, but now i only want to predict their reading-order.