It looks like we're writing a mix of both english and spanish terms to cassandra. For example, if ataque is a watchlist term for a Fortis site, where the primary language is spanish with english translation support.
If ataque is mentioned in a spanish tweet we archive that term in cassandra. We do the same if attack is mentioned in an english tweet. A data sample is listed below. This presents a problem in the Fortis interface as the services expect content to be aggregated based on terms in the base language. We need to enhance the keyword extraction analyzer to properly normalize this where attack is detected as ataque.
@erikschlegel Just to clarify, you'd expect keywords for which we have a translation to be stored as the English keyword? Specifically, in the example above, you'd want all instances of ataque to be replaced with attack?
It looks like we're writing a mix of both english and spanish terms to cassandra. For example, if
ataque
is a watchlist term for a Fortis site, where the primary language is spanish with english translation support. Ifataque
is mentioned in a spanish tweet we archive that term in cassandra. We do the same ifattack
is mentioned in an english tweet. A data sample is listed below. This presents a problem in the Fortis interface as the services expect content to be aggregated based on terms in the base language. We need to enhance the keyword extraction analyzer to properly normalize this whereattack
is detected asataque
.@erikschlegel Just to clarify, you'd expect keywords for which we have a translation to be stored as the English keyword? Specifically, in the example above, you'd want all instances of
ataque
to be replaced withattack
?Copied from https://github.com/CatalystCode/project-fortis-spark/issues/174