An observation by @brucemiller , and I would like to ask him to share additional ones if/when they come to mind:
The addition of latexml's tokenization "role" attributes into the lexemes may actually be "diluting" the context, rather than reinforcing it, when used in ELECTRA/Bert-style WordPiece tokenization. This was something we explored together while examining the ground embedding layer of the arxiv-electra-small model I pretrained back in March.
So, for the 2020 release of arXiv and respective regeneration of NLP datasets via llamapun, we may want to simplify e.g.
1 plus ( italic-x if italic-x less-than 0 else italic-y )
However, all grouping lexemes (such as ARRAY:start, ARRAY:end) should be preserved, or we lose important information.
An additional idea we may try for this iteration is to add two wrapping lexemes for each formula (MATH:start and MATH:end), so that the model becomes more alert to the mouth boundary. That will be particularly important after this issue is implemented, since we would no longer be able to distinguish between "1 plus" written in plain text vs math mode, as the roles are stripped out.
That said, we don't actually have proof that the roles are a problem for LMs, as my benchmarks have all performed as expected, slightly outperforming the paper's results. Where the GloVe+biLSTM approach treated e.g. "ADDOP:plus" as a single word, while WordPiece chunks it down to three separate pieces. The original motivation was a discussion about "FUNCTION", "OPERATOR" and "OPFUNCTION" being potential confounders for the models, as latexml may be mildly guesstimating when assigning them.
Up to Bruce if he wants me to remove them from the latexml output, but I am happy to remove them on the llamapun side, as it is easier to remove them once they're there than to try to add them when they're missing.
An observation by @brucemiller , and I would like to ask him to share additional ones if/when they come to mind:
The addition of latexml's tokenization "role" attributes into the lexemes may actually be "diluting" the context, rather than reinforcing it, when used in ELECTRA/Bert-style WordPiece tokenization. This was something we explored together while examining the ground embedding layer of the arxiv-electra-small model I pretrained back in March.
So, for the 2020 release of arXiv and respective regeneration of NLP datasets via llamapun, we may want to simplify e.g.
down to
However, all grouping lexemes (such as
ARRAY:start
,ARRAY:end
) should be preserved, or we lose important information.An additional idea we may try for this iteration is to add two wrapping lexemes for each formula (MATH:start and MATH:end), so that the model becomes more alert to the mouth boundary. That will be particularly important after this issue is implemented, since we would no longer be able to distinguish between "1 plus" written in plain text vs math mode, as the roles are stripped out.
That said, we don't actually have proof that the roles are a problem for LMs, as my benchmarks have all performed as expected, slightly outperforming the paper's results. Where the GloVe+biLSTM approach treated e.g. "ADDOP:plus" as a single word, while WordPiece chunks it down to three separate pieces. The original motivation was a discussion about "FUNCTION", "OPERATOR" and "OPFUNCTION" being potential confounders for the models, as latexml may be mildly guesstimating when assigning them.
Up to Bruce if he wants me to remove them from the latexml output, but I am happy to remove them on the llamapun side, as it is easier to remove them once they're there than to try to add them when they're missing.