datamade / bga-payroll

💰 How much do your public officials make?
4 stars 4 forks source link

improve the building of the search index #210

Open hancush opened 6 years ago

hancush commented 6 years ago

right now, we rebuild the whole index every time new data is uploaded. this isn't really necessary. only reindex the employers and their employees if new data is uploaded for that employer.

hancush commented 5 years ago

some additional ideas:

  1. solr add will update the item if it exists, i.e., it may not be necessary to delete index items, provided we can ensure consistent primary keys in the db / solr index. this second bit would require a refactor, and evz mentions that we have employed this technique in the past with councilmatic, but that sometimes the database and index get out of sync anyway, which suggests this works differently than we understand it and some additional homework needs to be completed.
  2. haystack's update_index can do some of this ^ heavy lifting for us. we did not use haystack for this search index for a few reasons: there's inefficiency baked into its index building command (https://github.com/datamade/devops/issues/42); it uses the orm, which would be quite slow for data of this app's magnitude (especially when uploading large chunks of data, but less of a problem for amending a smaller subset of employers); and we didn't anticipate we'd need the majority of its functionality, making it a pretty heavy dependency. we've debugged the inefficiency in its indexing operation, however points two and three are still an issue.
  3. we use a dockerized version of solr. rather than dropping and rebuilding an index in the existing container, build the index in a new container (or in a new volume?), and swap it in, once the new index complete. this is conceptually like the zero-downtime deployment approach. it would require some more research on docker and interacting with it in python, but of these options, it seems to come with the least amount of mysteries / code changes.