CatalystCode / project-fortis

Repository for all parts of the Fortis architecture
https://aka.ms/fortis-story
MIT License
29 stars 17 forks source link

Non-base language events are being dropped in the pipeline #28

Open c-w opened 6 years ago

c-w commented 6 years ago

It looks like we're writing a mix of both english and spanish terms to cassandra. For example, if ataque is a watchlist term for a Fortis site, where the primary language is spanish with english translation support. If ataque is mentioned in a spanish tweet we archive that term in cassandra. We do the same if attack is mentioned in an english tweet. A data sample is listed below. This presents a problem in the Fortis interface as the services expect content to be aggregated based on terms in the base language. We need to enhance the keyword extraction analyzer to properly normalize this where attack is detected as ataque.

      month | ataque |                   |                   |    11 |     Twitter |    EmbVZLA_enEsp |   11_917_648 | 2017-10-01 00:00:00.000000+0000 |                     0 |            1
        day | ataque |                   |                   |     6 |     Twitter |  RaicesPeronista |      6_28_20 | 2017-10-03 00:00:00.000000+0000 |                     0 |            1
        day | ataque |                   |                   |    13 |     Twitter |          sutpmcu | 13_3670_2592 | 2017-10-03 00:00:00.000000+0000 |                     0 |            1
      month |       attack |                   |                   |    13 |     Twitter |              all | 13_3670_2581 | 2017-10-01 00:00:00.000000+0000 |                     0 |            1
       hour |       attack |                   |                   |    13 |     Twitter |              all | 13_3670_2581 | 2017-10-03 01:00:00.000000+0000 |                     0 |            1

@erikschlegel Just to clarify, you'd expect keywords for which we have a translation to be stored as the English keyword? Specifically, in the example above, you'd want all instances of ataque to be replaced with attack?


Copied from https://github.com/CatalystCode/project-fortis-spark/issues/174