KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

404 on SENNA links #72

Closed msgoff closed 2 years ago

msgoff commented 2 years ago

Hello

I spend a lot of time learning how to parse LaTeX found in the ArXiV corpus. I am interested in contributing if you have some basic tasks where I could be useful.

Best Regards, Mike

dginev commented 2 years ago

Hi @msgoff !

You titled the issue 404 on SENNA links, which sounds like a problem for the rust-senna wrapper. Maybe we can transfer the issue there?

Separately, the way you've described your interest, it sounds like the intersection between ar5iv and latexml - you can take a look at the ar5iv issues for known problems with our conversion to HTML, and consider contributing upgrades to latexml - if that seems like an activity you would enjoy.

The llamapun repository here is currently in maintenance mode and isn't actively developed. Its tasks start where the conversion to HTML ends -- there are utilities to map down to plain text, and some experiments using basic ~2016 NLP methods.

There is a separate preprocessing library I have been working on, but I have kept its repository private until the bits there stabilize.

msgoff commented 2 years ago

Hello @dginev Sorry, I wasn't aware that this repository is in maintenance mode.
On the in the Readme for this project, the following links are no longer valid.
Maybe it would be ok to link to web.archive.org instead.
http://web.archive.org/web/20140208134927/http://ml.nec-labs.com/senna/

Tokenization - rule-based sentence segmentation, and SENNA word tokenization
Part-of-speech tagging (via SENNA),
Named Entity recognition (via SENNA),
Chunking and shallow parsing (via SENNA),

I have seen that you are working on the NLP side of things and had not heard of SENNA before which is why I was interested in learning more about the project.

Thank you for the suggestions. I will look into ar5iv and latexml issues.

dginev commented 2 years ago

Thanks for clarifying, I just updated the readme file.

I hope I can make more public from the post-2020 NLP work I have been doing at some pointer before the end of the year, but it could be next year. You can take a look at the other open issue here ( at #59 ), which gives a taste of the data. There is an associated talk I gave a couple of years ago too. Although there are now mainstream models one can use instead, if math syntax isn't a core interest (and if latex macros are the preferred modality for math).