Is this repository suitable for keywords extraction and chinese word sementation?

dwadden / dygiepp

Span-based system for named entity, relation, and event extraction.

MIT License

575 stars 120 forks source link

Is this repository suitable for keywords extraction and chinese word sementation? #1

Closed Jiakui closed 4 years ago

Jiakui commented 5 years ago

Hi ,

I thinks span representation is a great idea. Do you think the span representation is suitable for keywords extraction and chinese word sementation?

Thanks!

dwadden commented 5 years ago

Hi Jiakui,

I'm not that familiar with keyword extraction, is it similar to named entity recognition? If so, you should be able to use our model for that. I'm still working on cleaning it up so that it's easily usable - hopefully in the next month or so.

For a sequence tagging task like segmentation I don't think there's really an advantage to enumerating all possible text spans. I think you're better off using an LSTM-CRF or something like that.

Let me know if you've got more questions!

Dave

luanyi commented 5 years ago

In fact we have already applied SciIE (an earlier version of DyGIE) in scientific keyword extraction task (https://arxiv.org/abs/1808.09602) and observed improvement over LSTM+CRF. For Chinese word segmentation, I actually think Chinese might be a better language to apply DyGIE since there is no clear word boundary in Chinese (inputs are in pure character level). DyGIE might be able to better solve the problem since it is enumerating all possible text spans. It will work better for segments with overlaps at least than traditional LSTM+CRF for sure.

dwadden commented 5 years ago

@Jiakui just letting you know that the code runs now. If you pull it and follow the instructions in the README, you should be able to train a model. To adapt for Chinese word segmentation, you'll probably want to adapt the NER module https://github.com/dwadden/dygiepp/blob/master/dygie/models/ner.py.