f4bD3v / humanitas

A price prediction toolset for developing countries
BSD 3-Clause "New" or "Revised" License
17 stars 7 forks source link

Filtering & NLP of tweets #19

Closed f4bD3v closed 10 years ago

f4bD3v commented 10 years ago

To make sense of the tweets we're collecting we have to cluster them according to indicators we want to feed into our Neural Networks.

The first step is to filter the tweets hierarchically according to certain categories:

general:

specific: Price --> Food --> Commodity --> Indicator

Indicator words are "increase", "decrease", "high", "low" and their synonyms. What are good indicators for making a prediction of a price?

The tweets we group into these categories are then ordered by their timestamps, counted and fed into the network as a sequence for each category. The scaling coefficient will have to be found empirically.

The Question is, given time constraints, do we want to implement a simple filtering or a feature-based clustering algorithm?

If we implement the latter do we use k-means clustering or Spectral clustering?

f4bD3v commented 10 years ago

Additional words for general food category: 'snack', 'rice'?, 'groceries', 'cook'

work in progress guys, we have to make absolutely sure we get all the tweets through filtering. I think we could still refine our approachThis pattern library is really powerful and we could run it on the tweets we write to the databaseFor filtering, we should use both suggestion as well as edit distance and PoS-tagging keep associated PoS tags for all keywords if a word doesn't fit any keyword, compute a suggestion and check if PoS tags match and edit distance threshold (to keyword) is satisfied, if no suggestion available just check edit distance and compare PoS tags