Estimate effort to run automated similarity calculations and save to API database

aih commented 2 years ago

This issue is to estimate the task to:

Process a bill with a similarity algorithm to return a list of similar bills in the form of a BillToBill model (e.g. on the investigate_simhashes branch: https://github.com/dreamproit/bill-similarity/pull/4/files)
Save the similar bills to the BillToBill table of billtitles-py (https://github.com/dreamproit/billtitles-py/blob/main/billtitles/models.py#L72). This uses the helper function, create_billtobill to save the billtobill data: https://github.com/dreamproit/billtitles-py/blob/main/billtitles/crud.py#L123
Create a pipeline to do this: a. Once for all bills b. Each time bills are updated by the uscongress bill scraper

Related to dreamproit/BillMap#13

aih commented 2 years ago

For this approach, we can create a docker-compose that includes:

The uscongress scraper (https://github.com/unitedstates/congress/blob/main/Dockerfile)
Bill similarity algorithm to calculate similar bills
The billtitles-py API to store BillToBill models and to get responses

We will also run Celery processes to run these tasks.

aih commented 2 years ago

We would implement this in stages, by:

Run similarity algorithm for all bills and load to the billtitles-py API
Test the results in the billtitles-py API against results for BillMap to test a) performance of the bill-similarity algorithm and b) the accuracy in finding similar bills

dreamproit / bill-similarity