I am worried that currently when extracting keywords we overmatch some. For example, we might match a DH keyword as well
a ECDH keyword for a document that has a "This thing implements ECDH as well as some other stuff." string in it.
This is due to the way we do regex matching. It is done depth first over all of the rules, where each rule has the REGEXEC_SEP = r"[ ,;\]”)(]" regex separator appended to it (but not prepended). This means that we are fine and DH will not match "This thing has DHE in it." but it will match "This thing has ECDH in it.", which is an issue.
We need to think a bit more about the strategy of extracting keywords and what we are achieving with the REGEXC_SEP
and whether a better solution exists.
I am worried that currently when extracting keywords we overmatch some. For example, we might match a
DH
keyword as well aECDH
keyword for a document that has a"This thing implements ECDH as well as some other stuff."
string in it.This is due to the way we do regex matching. It is done depth first over all of the rules, where each rule has the
REGEXEC_SEP = r"[ ,;\]”)(]"
regex separator appended to it (but not prepended). This means that we are fine andDH
will not match"This thing has DHE in it."
but it will match"This thing has ECDH in it."
, which is an issue.We need to think a bit more about the strategy of extracting keywords and what we are achieving with the
REGEXC_SEP
and whether a better solution exists.