Should run in the background* as a cron job or a service
Should cyclically query Elasticsearch** for all tweets that have not been embedded, processing chunks of K tweets at a time.
The text for each tweet should be cleaned and then embedded using Universal Sentence Encoder
The vectors should be updated back to elasticsearch using a partial bulk index per chunk.
We don't want to embed the tweets immediately as they are received from the Twitter stream API because we don't want to synchronously bog down the twitter_monitor process. Doing so would reduce our ingestion throughput and risk causing full buffer disconnects from the Twitter side.
** Note: this requires Elasticsearch 7.x for support of the dense_vector field type.
Requirements:
Should run in the background* as a cron job or a service
Should cyclically query Elasticsearch** for all tweets that have not been embedded, processing chunks of K tweets at a time.
The text for each tweet should be cleaned and then embedded using Universal Sentence Encoder
The vectors should be updated back to elasticsearch using a partial bulk index per chunk.
We don't want to embed the tweets immediately as they are received from the Twitter stream API because we don't want to synchronously bog down the twitter_monitor process. Doing so would reduce our ingestion throughput and risk causing full buffer disconnects from the Twitter side.
** Note: this requires Elasticsearch 7.x for support of the dense_vector field type.