Hironsan / anago

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
https://anago.herokuapp.com/
MIT License
1.48k stars 371 forks source link

Best practise for long documents #87

Open psinger opened 5 years ago

psinger commented 5 years ago

I am currently working with longer documents compared to the shorter ones the current version is tailored towards. Currently, the code pads all sequences to the length of the longest one, which in my case can be quite long.

Is there some best practise on how to deal with this? I assume the best way is to first build sequences out of documents myself, but how long should they be, and should they overlap or stand on its own?

Looking forward to any pointers. Thanks.