charlesvdv / nom-bibtex

A feature complete bibtex parser using nom
https://docs.rs/nom-bibtex
MIT License
24 stars 15 forks source link

Support for special characters #13

Open netzdoktor opened 3 years ago

netzdoktor commented 3 years ago

Thanks for this nice little crate. I'm using it indirectly, as I am creating a research website using zola.

Now I ran into the issue that I published with people that have special characters in their name. We created a bibtex like this to make sure it works with most LaTex systems.

I don't want to maintain multiple files (one for Zola with ü one for LaTex with \"u), so it would be great to improve nom-bibtex.

How could we add support for this in nom-bibtex? I think I would be willing to make a PR with a bit of help.

charlesvdv commented 3 years ago

Hello @Darneas!

Thanks for the kind word and letting me know that zola is actually using my crate :+1:

If you are willing to do a PR, I would be happy to review it and accept it. Which kind of information/help would need to contribute?

netzdoktor commented 3 years ago

You're welcome @charlesvdv!

That's FOSS... you never know where your code is put to good use ;-)

Starting question would be where to put unit tests and implementation. parser.rs?

I guess a way forward would be that I add the above mentioned characters into unit tests and we make progress from the place where things fail (in zola, I got a parser error with in IsNot or something like that).

charlesvdv commented 3 years ago

It's definetely in parser.rs that you will find the bug. Regarding the unit test, there are already a bunch of tests in there. If you want to ease your testing, you can also add an integration test in the tests folder.

maurofaccin commented 3 years ago

Hi, I'm also a zola user. Expanding on @Darneas issue, it would be cool if all special symbols described on Bibtex webpage would be recognized and parsed.

In particular, I'm talking about the Special Symbols page.

I'm not sure how latex commands and math mode should be treated but there is a list of special characters as well as the protected formatting rule (words and chars between curly brackets should not be formatted, as of now curly brackets are passed by).

TBH, I'm not sure this processing belongs to the parser or to the client (in this case Zola).

najtin commented 3 years ago

I use pylatexenc to parse parts of a huge bibtex file in python. pylatexenc is really powerful and can almost parse anything you throw at it. Unfortunately my "homemade" bibtex parser which internally uses pylatexenc is really slow. This is why i searched for a project just like nom-bibtex. Though we can not directly port the file were the magic happens: https://github.com/phfaist/pylatexenc/blob/master/pylatexenc/latex2text/_defaultspecs.py we may be able to use it to kickstart things. I am willing to put in the necessary effort in the next few weeks because i desperately need a much faster solution.

charlesvdv commented 3 years ago

I think special symbols should probably be handled by the library since they are specified in the specification. For math-mode, it's a bit more tricky. I would maybe support common use-cases that be easily converted into unicode codepoints. For the rest, I guess it would make more sense to let the client code handle it.

@najtin if you have some time to contribute, don't hesitate ;)