Improve article 2207.06991

Minipage with algorithm is at: https://ar5iv.labs.arxiv.org/html/2207.06991#S2.SS2.SSS0.Px2.p2
The source TeX in question is:
\vspace{-2mm}
\paragraph{Patch Embeddings} The images produced by the text renderer (\S\ref{sec:renderer}) are patch-wise linearly projected to obtain a sequence of patch embeddings with a 16 $\times$ 16 pixel resolution, to which fixed sinusoidal position embeddings are added.\footnote{This is a fast operation that does not require the large text embedding layer found in subword-based models, saving parameters which could in theory be re-allocated to the self-attention stack. We refer to \citet{xue-etal-2022-byt5} for a discussion regarding benefits and drawbacks of re-allocation of embedding layer weights.}

\begin{wrapfigure}{r}{0.35\textwidth}
    \vspace{-2.25em}
    \centering
    \begin{minipage}{\linewidth}
    \begin{algorithm}[H]
    \caption{\model Span Masking}\label{algo:span_masking}
    \begin{algorithmic}
        \scriptsize
        \Input{\#Image patches $N$, masking ratio $R$, maximum masked span length $S$, span length cumulative weights ${W=\{w_1,\ldots,w_S\}}$}
        \Output{Masked patches $\mathcal{M}$}
        \State $\mathcal{M} \leftarrow \emptyset $
        \Repeat{
            \State $s \leftarrow \text{randchoice}({\{1,\ldots, S\},W}$) %\Comment{$S=6$, $\mathbb{E}(s)=3.1$}
            \State $l \leftarrow \text{randint}(0, \text{max}(0, N - s))$
            \State $r \leftarrow l + s$
            \If{$\mathcal{M} \cap \{l-s, \ldots, l-1\} = \emptyset $ \textbf{and} \\\hspace{8mm}$\mathcal{M} \cap \{r+1, \ldots, r+s\} = \emptyset $}
            \State $\mathcal{M} \leftarrow \mathcal{M} \cup \{l,\ldots,r\}$
            \EndIf
        }
        \Until{$\lvert \mathcal{M} \rvert > R \cdot N$} %\Comment{$R=0.25$}\\
        \Return{$\mathcal{M}$}
    \end{algorithmic}
    \end{algorithm}
    \end{minipage}
    \vspace{-1em}
\end{wrapfigure}

\vspace{-2mm}
\paragraph{Patch Span Masking} Instead of the random masking procedure used in ViT-MAE or block-wise masking in BEiT \citep{bao2022beit}, \model uses span masking with a 25\% masking ratio as outlined in Algorithm~\ref{algo:span_masking}, which masks spans of up to $S=6$ consecutive image patches with a dynamic number of unmasked patches left between them. The idea behind the span masking approach, inspired by T5 \citep{raffel-etal-2020-t5} and SpanBERT \citep{joshi-etal-2020-spanbert}, is that it masks more meaningful units of text (full words or phrases) than random masking where the model more often has to fill in (parts of) individual characters, thereby encouraging \model to model a higher level of abstraction.
In practice, span masking was slightly more effective than random masking in early prototypes of \model. 
This effect may be less noticeable at higher masking ratios (such as the 75\% used in ViT-MAE), when random masking would more often masks consecutive patches.
We found 25\% masking ratio to work well for \model-base, which is in line with recent findings for \textsc{bert}-type models of similar size \citep{wettig-etal-2022-mask}. 
We mask spans of $s \in \{1,2,3,4\}$ patches in length, each with 20\% probability, and spans of $s \in \{5,6\}$ patches with 10\% probability each, so $\mathbb{E}(s)=3.1$.
current HTML	PDF
dginev / ar5iv

Improve article 2207.06991 #315