dginev / ar5iv

A web service offering HTML5 articles from arXiv.org as converted with latexml
https://ar5iv.org
MIT License
767 stars 21 forks source link

Improve article 2207.06991 #315

Open dginev opened 1 year ago

dginev commented 1 year ago

Exact location of issue

As reported on social media. Main body of document, undefined macros.

Separately, there is a fidelity issue worth considering in general for minipages+floats in Algorithm 1. Namely, latexml adds a width attribute to the minipage ltx_para, constraining it on a wide screen, but our current CSS does not attempt to put the algorithm listing on its right-hand side. Ideally, this should be another use for flexbox, where we'll see the paragraph+algorithm minipages side-by-side on wide displays, and have them reflow to a vertical stack on mobile.

Problem details

Conversion complete: 4 warnings; 2 errors; 
2 undefined macros[\textcommabelow, \euflag];
 2 missing files[inconsolata.sty, euflag.sty]
dginev commented 1 year ago

Minipage with algorithm is at: https://ar5iv.labs.arxiv.org/html/2207.06991#S2.SS2.SSS0.Px2.p2

The source TeX in question is:

\vspace{-2mm}
\paragraph{Patch Embeddings} The images produced by the text renderer (\S\ref{sec:renderer}) are patch-wise linearly projected to obtain a sequence of patch embeddings with a 16 $\times$ 16 pixel resolution, to which fixed sinusoidal position embeddings are added.\footnote{This is a fast operation that does not require the large text embedding layer found in subword-based models, saving parameters which could in theory be re-allocated to the self-attention stack. We refer to \citet{xue-etal-2022-byt5} for a discussion regarding benefits and drawbacks of re-allocation of embedding layer weights.}

\begin{wrapfigure}{r}{0.35\textwidth}
    \vspace{-2.25em}
    \centering
    \begin{minipage}{\linewidth}
    \begin{algorithm}[H]
    \caption{\model Span Masking}\label{algo:span_masking}
    \begin{algorithmic}
        \scriptsize
        \Input{\#Image patches $N$, masking ratio $R$, maximum masked span length $S$, span length cumulative weights ${W=\{w_1,\ldots,w_S\}}$}
        \Output{Masked patches $\mathcal{M}$}
        \State $\mathcal{M} \leftarrow \emptyset $
        \Repeat{
            \State $s \leftarrow \text{randchoice}({\{1,\ldots, S\},W}$) %\Comment{$S=6$, $\mathbb{E}(s)=3.1$}
            \State $l \leftarrow \text{randint}(0, \text{max}(0, N - s))$
            \State $r \leftarrow l + s$
            \If{$\mathcal{M} \cap \{l-s, \ldots, l-1\} = \emptyset $ \textbf{and} \\\hspace{8mm}$\mathcal{M} \cap \{r+1, \ldots, r+s\} = \emptyset $}
            \State $\mathcal{M} \leftarrow \mathcal{M} \cup \{l,\ldots,r\}$
            \EndIf
        }
        \Until{$\lvert \mathcal{M} \rvert > R \cdot N$} %\Comment{$R=0.25$}\\
        \Return{$\mathcal{M}$}
    \end{algorithmic}
    \end{algorithm}
    \end{minipage}
    \vspace{-1em}
\end{wrapfigure}

\vspace{-2mm}
\paragraph{Patch Span Masking} Instead of the random masking procedure used in ViT-MAE or block-wise masking in BEiT \citep{bao2022beit}, \model uses span masking with a 25\% masking ratio as outlined in Algorithm~\ref{algo:span_masking}, which masks spans of up to $S=6$ consecutive image patches with a dynamic number of unmasked patches left between them. The idea behind the span masking approach, inspired by T5 \citep{raffel-etal-2020-t5} and SpanBERT \citep{joshi-etal-2020-spanbert}, is that it masks more meaningful units of text (full words or phrases) than random masking where the model more often has to fill in (parts of) individual characters, thereby encouraging \model to model a higher level of abstraction.
In practice, span masking was slightly more effective than random masking in early prototypes of \model. 
This effect may be less noticeable at higher masking ratios (such as the 75\% used in ViT-MAE), when random masking would more often masks consecutive patches.
We found 25\% masking ratio to work well for \model-base, which is in line with recent findings for \textsc{bert}-type models of similar size \citep{wettig-etal-2022-mask}. 
We mask spans of $s \in \{1,2,3,4\}$ patches in length, each with 20\% probability, and spans of $s \in \{5,6\}$ patches with 10\% probability each, so $\mathbb{E}(s)=3.1$.
current HTML PDF
Screenshot 2023-08-30 at 17-28-56 Language Modelling with Pixels Screenshot 2023-08-30 at 17-29-17 2207 06991 pdf