Aron already stubbed this out, should just need to fill in the code in tasks.py.
List of tasks I can think of:
Adding new nonprofit
Given new EIN, add Guidestar info from search API and exchange API to nonprofits table. Includes financial info and board member info.
Update nonprofits table with missing nonprofit Twitter names
Look up missing Twitter IDs corresponding to known nonprofit Twitter names and store in nonprofits table.
Look up followers for nonprofit and store in DB. Should looking up followers and tweets happen synchronously? Is API limit distinct for each or is it combined?
Look up tweets for nonprofit and store in DB.
Calculate similarity scores using tweets.
Calculate community membership using tweets.
Look up homepage for nonprofit and store in DB.
Calculate similarity scores using nonprofit homepage.
Calculate community membership using nonprofit homepage similarities.
Look up news articles for nonprofit, search for company names, store in DB.
Calculate similarity scores using descriptions.
Calculate community membership using description similarity.
???
Updating existing nonprofit
Given existing EIN, update with Guidestar info from search API and exchange API. Includes financial info and board member info.
Update nonprofits table with missing nonprofit Twitter names
Look up missing Twitter IDs corresponding to known nonprofit Twitter names and store in nonprofits table.
Look up followers for nonprofit and store in DB. Should looking up followers and tweets happen synchronously? Is API limit distinct for each or is it combined?
Look up tweets for nonprofit and store in DB.
Calculate similarity scores using tweets.
Calculate community membership using tweets.
Look up homepage for nonprofit and store in DB.
Calculate similarity scores using nonprofit homepage.
Calculate community membership using nonprofit homepage similarities.
Look up news articles for nonprofit, search for company names, store in DB.
Calculate similarity scores using descriptions.
Calculate community membership using description similarity.
Some state is being transferred via the DB from one task to another -- we could optimize by storing the state in the DB, but also passing it directly in memory.
Adding new nonprofits and updating existing nonprofits involve overlapping tasks but with slightly different dependencies.
One approach that might make sense is to have adding a new nonprofit be event-driven, but updating existing nonprofits happen at regular intervals. Also need to make sure tasks running for a new nonprofit don't conflict with regularly-running tasks for updating existing info.
Another option is to have adding a new nonprofit just update the nonprofits table. Then, the regularly-running tasks for updating existing nonprofits just include that new row in their SELECTs, and its info gets updated along with every other nonprofits' info. This is a simpler approach, but means waiting longer for new nonprofits to get their info retrieved. EDIT: This is actually OK, the only dependency difference is that when adding a new nonprofit, the Guidestar lookup has to happen first, which is pretty fast.
Some of the Twitter stuff is complicated by the fact that Twitter names can change, but Twitter user IDs are constant. So really we should always be looking stuff up in the Twitter API by user ID. However, maybe we don't care about this edge case, which would simplify things a little bit.
Aron already stubbed this out, should just need to fill in the code in tasks.py.
List of tasks I can think of:
Adding new nonprofit
Updating existing nonprofit