I think we are loading all of NomBank for each sentence, which is the slowest part of the preprocessing pipeline. Is there an easy way to avoid this, e.g.
splitting up the NomBank lines by document into .nom files, analagously to the .prop files, or
using NLTK's NomBank interface (I doubt this is much faster), or
adding NomBank annotations for all sentences at once (rather than looping through the sentence files in a shell script)?
I think we are loading all of NomBank for each sentence, which is the slowest part of the preprocessing pipeline. Is there an easy way to avoid this, e.g.
.nom
files, analagously to the.prop
files, or