Closed msgoff closed 2 years ago
Hi @msgoff !
You titled the issue 404 on SENNA links
, which sounds like a problem for the rust-senna wrapper. Maybe we can transfer the issue there?
Separately, the way you've described your interest, it sounds like the intersection between ar5iv and latexml - you can take a look at the ar5iv issues for known problems with our conversion to HTML, and consider contributing upgrades to latexml - if that seems like an activity you would enjoy.
The llamapun repository here is currently in maintenance mode and isn't actively developed. Its tasks start where the conversion to HTML ends -- there are utilities to map down to plain text, and some experiments using basic ~2016 NLP methods.
There is a separate preprocessing library I have been working on, but I have kept its repository private until the bits there stabilize.
Hello @dginev
Sorry, I wasn't aware that this repository is in maintenance mode.
On the in the Readme for this project, the following links are no longer valid.
Maybe it would be ok to link to web.archive.org instead.
http://web.archive.org/web/20140208134927/http://ml.nec-labs.com/senna/
Tokenization - rule-based sentence segmentation, and SENNA word tokenization
Part-of-speech tagging (via SENNA),
Named Entity recognition (via SENNA),
Chunking and shallow parsing (via SENNA),
I have seen that you are working on the NLP side of things and had not heard of SENNA before which is why I was interested in learning more about the project.
Thank you for the suggestions. I will look into ar5iv and latexml issues.
Thanks for clarifying, I just updated the readme file.
I hope I can make more public from the post-2020 NLP work I have been doing at some pointer before the end of the year, but it could be next year. You can take a look at the other open issue here ( at #59 ), which gives a taste of the data. There is an associated talk I gave a couple of years ago too. Although there are now mainstream models one can use instead, if math syntax isn't a core interest (and if latex macros are the preferred modality for math).
Hello
I spend a lot of time learning how to parse LaTeX found in the ArXiV corpus. I am interested in contributing if you have some basic tasks where I could be useful.
Best Regards, Mike