PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

corenlp_annotate() | creating new nodes removes potential attributes #36

Closed ChristophLeonhardt closed 1 year ago

ChristophLeonhardt commented 2 years ago

Problem

Using corenlp_annotate() in bignlp version 0.1.3.9002 it is necessary to identify nodes in which the actual textual content can be found. This is done via the argument xpath which defaults to "\\p". The text of these nodes is retrieved, passed to the annotation pipeline and finally, the name of the initial node is added to the annotated data.

A remaining gap is that in consequence only the name of the node queried by the xpath and its text is kept while potential attributes are dropped silently.

I think that the attributes should be added back to the new nodes.

Possible Solution

One reasonable solution might be to add

new_nodes <- xml_find_all(xml_doc_tmp, xpath = xpath)
xml2::xml_attrs(new_nodes) <- sapply(text_nodes, xml_attrs)

after this following existing chunk:

https://github.com/PolMine/bignlp/blob/e6a6bda102d338880be787caf53b1f03a728600a/R/corenlp.R#L338-L343

At this point, the nodes are back as XML and adding attributes from the original text nodes should be fast and robust as long as the annotation pipeline indeed returns all text nodes (empty text nodes were removed earlier, so this should not be an issue) and does so in the correct order.

ablaette commented 1 year ago

Good point and good suggestion. I chose a somewhat different implementation that I found more reassuring to avoid slips between nodes, though.