CatalystCode / project-fortis-spark

A repository for all spark jobs running on fortis
MIT License
9 stars 4 forks source link

Non-base language events are being dropped in the pipeline #174

Closed erikschlegel closed 6 years ago

erikschlegel commented 7 years ago

It looks like we're writing a mix of both english and spanish terms to cassandra. For example, if ataque is a watchlist term for a Fortis site, where the primary language is spanish with english translation support. If ataque is mentioned in a spanish tweet we archive that term in cassandra. We do the same if attack is mentioned in an english tweet. A data sample is listed below. This presents a problem in the Fortis interface as the services expect content to be aggregated based on terms in the base language. We need to enhance the keyword extraction analyzer to properly normalize this where attack is detected as ataque.

      month | ataque |                   |                   |    11 |     Twitter |    EmbVZLA_enEsp |   11_917_648 | 2017-10-01 00:00:00.000000+0000 |                     0 |            1
        day | ataque |                   |                   |     6 |     Twitter |  RaicesPeronista |      6_28_20 | 2017-10-03 00:00:00.000000+0000 |                     0 |            1
        day | ataque |                   |                   |    13 |     Twitter |          sutpmcu | 13_3670_2592 | 2017-10-03 00:00:00.000000+0000 |                     0 |            1
      month |       attack |                   |                   |    13 |     Twitter |              all | 13_3670_2581 | 2017-10-01 00:00:00.000000+0000 |                     0 |            1
       hour |       attack |                   |                   |    13 |     Twitter |              all | 13_3670_2581 | 2017-10-03 01:00:00.000000+0000 |                     0 |            1
c-w commented 6 years ago

@erikschlegel Just to clarify, you'd expect keywords for which we have a translation to be stored as the English keyword? Specifically, in the example above, you'd want all instances of ataque to be replaced with attack?

c-w commented 6 years ago

Resolving as we're now tracking this elsewhere.