dreamproit / bill-similarity

Calculate similarity of bill documents using a variety of NLP approaches
1 stars 0 forks source link

Estimate effort to run automated similarity calculations and save to API database #8

Open aih opened 2 years ago

aih commented 2 years ago

This issue is to estimate the task to:

  1. Process a bill with a similarity algorithm to return a list of similar bills in the form of a BillToBill model (e.g. on the investigate_simhashes branch: https://github.com/dreamproit/bill-similarity/pull/4/files)
  2. Save the similar bills to the BillToBill table of billtitles-py (https://github.com/dreamproit/billtitles-py/blob/main/billtitles/models.py#L72). This uses the helper function, create_billtobill to save the billtobill data: https://github.com/dreamproit/billtitles-py/blob/main/billtitles/crud.py#L123
  3. Create a pipeline to do this: a. Once for all bills b. Each time bills are updated by the uscongress bill scraper

Related to dreamproit/BillMap#13

This is the equivalent of the pipeline that is already working to populate the database for BillMap, described here: https://github.com/dreamproit/bill-similarity/blob/investigate_simhashes/docs/SQL_APPROACH.adoc#current-data-pipeline-and-storage

aih commented 2 years ago

For this approach, we can create a docker-compose that includes:

  1. The uscongress scraper (https://github.com/unitedstates/congress/blob/main/Dockerfile)
  2. Bill similarity algorithm to calculate similar bills
  3. The billtitles-py API to store BillToBill models and to get responses

We will also run Celery processes to run these tasks.

aih commented 2 years ago

We would implement this in stages, by:

  1. Run similarity algorithm for all bills and load to the billtitles-py API
  2. Test the results in the billtitles-py API against results for BillMap to test a) performance of the bill-similarity algorithm and b) the accuracy in finding similar bills