ISG-ICS / cloudberry

Big Data Visualization
http://cloudberry.ics.uci.edu
91 stars 82 forks source link

TwitterMap : caching the result of a query #373

Closed waans11 closed 6 years ago

waans11 commented 7 years ago

In addition to the city polygon caching, it would be nice to cache the result of a query so that the front-end doesn't need to communicate with the Cloudberry everytime the current user changes a region. The following things need to be done:

1) Let the middleware send the city level result at any time 2) Let the frontend cache the above result and uses it whenever a user changes a region. 3) Prefetching + Replacement policy adjustment

waans11 commented 7 years ago

The implementation can be divided into several steps.

a. Only cache the query result at the city level just like we do cache the city polygons. A simple naive pre-fetch policy can be adjusted at this step to ease the testing of the feature (maybe 150% size of the original request region). The replacement policy can be also simple. When the query keyword or the time range is changed, regard it as a new query and empty the cache store. b. Extend the cache feature to support all level (state, county and city) caching. d. Modify the current Middle-ware to report the city level result at any time so that the caching effect can be maximized. e. The result of multiple queries can be cached at this step. Also, a more sophisticated pre-fetch and replacement policy can be built and applied.

chenlica commented 7 years ago

Good summary. Also after each step, let's merge the code to the master and push it to the live system to see the immediate improvement.

vacha19 commented 7 years ago

Design for caching the result of a query:

  1. The main idea is to cache the city ids and the number of tweets associated with each city id into the cache that are related to a particular query and time range.
  2. The cache structure will have two variables: one storing the keyword and other storing the time range.
  3. The city ids and number of tweets will be stored in a HashMap as a key-value pair.
  4. At the beginning, when the cache is empty, the keyword and the time range requested will be put in as variables and the associated city ids and result in the cache structure.
  5. Now when there is another request, the cache will be checked first to see if there is any data related to the requested keyword and time range.
  6. If there is any data found, the cache will be checked to see if city ids in the region the user wants to view are present in the key-value store by doing a map lookup.
  7. If yes, the results will be returned from the cache or else the query will be sent to the middleware to fetch the results.
  8. If the keyword or time range in the cache are different, the current cache is cleared and the request will be sent to the middleware to fetch the results. new_doc_13_1
waans11 commented 7 years ago

It looks goo.d Two minor comments: For 2. There need to be three variables? keyword, timestamp_start, and timestamp_end For 7. It would be faster if we only send the city IDs that were not in the cache to the middleware instead of sending the entire query request.

dharini-s commented 7 years ago

Updated cache design:

  1. The common module handles sending and receiving of middleware requests.
  2. The middleware typically returns 3 results for a batchJson request. Q1 is timeResult for the timeseries data, Q2 is mapResult data for storing tweet counts, Q3 is hashtagResult.
  3. Upon receiving results from middleware, the common module updates the resultcache module only with result of Q2.
  4. There are 3 cases:
    • In case 1, there is a complete cache miss and front-end sends Q1, Q2 and Q3 to middleware.
    • In case 2, there is a partial cache hit and front-end sends Q1, new requests in Q2 and Q3 to middleware
    • In case 3, there is a complete cache hit and front-end sends Q1and Q3 to middleware. Results of Q2 are obtained from resultcache module. Implementation details:
  5. The key-value store used is hashmap.js.
  6. If there is a change in keywords or time range, the cache is cleared and is treated as a complete cache miss.
  7. The store is updated only when there is a "done" message from middleware indicating query slicing is complete. Intermediate results are not stored. design1
chenlica commented 7 years ago

Looks very nice!

waans11 commented 7 years ago

Well designed!

waans11 commented 6 years ago

PR #406