flrngel / understanding-ai

personal repository

36 stars 6 forks source link

What you get is what you see: A visual markup decompiler #19

Open flrngel opened 6 years ago

flrngel commented 6 years ago

https://arxiv.org/abs/1609.04938

1. Abstract

this model is end-to-end
model uses convolutional network and recurrent network
current models achieve 25% accuracy, but paper model achieves 75% accuracy

2. Introduction

OCR requires joint processing of image and text data
WYGIWYS is simple extension of the attention-based encoder-decoder model
Paper introduces IM2LATEX-100k Dataset

3. Problem: image-to-markup generation

author defined the image-to-markup problem as converting a rendered source image t o target presentational markup

4. Model

Convolutional Network

Convolutional network does not uses fully connected layer
- this preserve locality of CNN features in order to use visual attention
  Row Encoder
Show, Attend and Tell shows image feature grid can be directly fed into decoder
- decoder contains significant relative sequential order information
- so using rnn can be help in
- left-to-right order can be easily learned by encoder
- RNN can utilize the surrounding horizontal context to refine the hidden representation
  Decoder
uses attention model (Bahdanau attention)
uses beam search on test time

5. Dataset

Tokenization

character based models were not that good
Optional: Normalization
modified KaTeX due to produce normalized input data

My Notes

each github project has different loss functions