el33th4x0r / crosstex

CrossTeX is a BibTeX replacement, with better citation and bibliographic database support.
GNU General Public License v2.0
14 stars 10 forks source link

CrossTex fails on bib files with comments #6

Open callegar opened 11 years ago

callegar commented 11 years ago

For instance, consider IEEEabrv.bib, distributed in the IEEE transactions template.

See http://tug.ctan.org/tex-archive/macros/latex/contrib/IEEEtran/bibtex/IEEEabrv.bib

el33th4x0r commented 11 years ago

Hi,

That file doesn't look like a legitimate BIB file to me. In particular, none of the English comments are preceded by any indication that they are a comment; they are just interspersed through the document.

callegar commented 11 years ago

It is legitimate.

See the bibtex documentation (and sample) files btxdoc.pdf and btxdoc.bib. They are in most tex/latex distro and certainly in texlive.

For bibtex all that is not in an entry is a comment.

callegar commented 11 years ago

Indeed, the bibtex documentation explicitly says: If you want to comment out an entry, simply remove the ‘@’ character preceding the entry type.

el33th4x0r commented 11 years ago

Thanks, I didn't realize that free-form text was considered a legitimate BibTeX comment. One more reason to avoid BibTeX!

I'll see if we can modify our parser to accommodate this, but it may not be so easy to deal with things like unmatched braces in the free-form text.

callegar commented 11 years ago

Indeed, some of the original choices in bibtex turned out to be really problematic (7bits, bad scripting language for bst files... and free-form text). Unfortunately a very large number of pre-built, downloadable bib files in the scientific area take advantage of the free-form comment style, so I think that a bibtex replacement should really deal with them properly when working in bibtex mode.

Rather than modifying the parser, another idea could be to make it 2-steps. First strip away all that is not in-entry, secondly process the entries with your current parser.

To strip away all that is not in-entry with a state machine is easy. Suppose you have 2 states ON and OFF and a counter. Start in OFF. In off state, when a char gets in, see if it is an @. If it is, move to the ON state, otherwise drop it. When in ON state, when if a char gets in, copy it to output. Also see if it is a {. If it is, increment the counter. Also see if it is }, if it is decrement the counter and if the counter goes to 0, move to the OFF state.

This is a bit rough, since there may be @s in comments, like in an email address but I think it could be good enough. The proper way would be to get in ON state if you have a sequence like "@word{" which requires storing strings.

Be careful not to get wrong with @ that are in-entry. These must be preserved (since they may belong to a \@ command, or be inside an email address in a note). Pybibtex (a competing attempt at building a bibtex replacement in python) currently gets crazy at them.