Using corenlp_annotate() in bignlp version 0.1.3.9002 it is necessary to identify nodes in which the actual textual content can be found. This is done via the argument xpath which defaults to "\\p". The text of these nodes is retrieved, passed to the annotation pipeline and finally, the name of the initial node is added to the annotated data.
A remaining gap is that in consequence only the name of the node queried by the xpath and its text is kept while potential attributes are dropped silently.
I think that the attributes should be added back to the new nodes.
At this point, the nodes are back as XML and adding attributes from the original text nodes should be fast and robust as long as the annotation pipeline indeed returns all text nodes (empty text nodes were removed earlier, so this should not be an issue) and does so in the correct order.
Problem
Using
corenlp_annotate()
inbignlp
version 0.1.3.9002 it is necessary to identify nodes in which the actual textual content can be found. This is done via the argumentxpath
which defaults to"\\p"
. The text of these nodes is retrieved, passed to the annotation pipeline and finally, the name of the initial node is added to the annotated data.A remaining gap is that in consequence only the name of the node queried by the xpath and its text is kept while potential attributes are dropped silently.
I think that the attributes should be added back to the new nodes.
Possible Solution
One reasonable solution might be to add
after this following existing chunk:
https://github.com/PolMine/bignlp/blob/e6a6bda102d338880be787caf53b1f03a728600a/R/corenlp.R#L338-L343
At this point, the nodes are back as XML and adding attributes from the original text nodes should be fast and robust as long as the annotation pipeline indeed returns all text nodes (empty text nodes were removed earlier, so this should not be an issue) and does so in the correct order.