cr8ivecodesmith / marketsnitch

#KabantayNgBayan Hackathon Entry
0 stars 0 forks source link

Script for Periodically Updating Data #28

Open codemickeycode opened 9 years ago

codemickeycode commented 9 years ago

Create a facility for periodically updating/fetching Organization, Awards, Bidders List, Bid Line Item and Bid Information from PhilGEPS Public Data

codemickeycode commented 9 years ago

Celery task for processing the data - will take care of downloading the CSVs from PhilGEPs endpoint on a periodic basis

codemickeycode commented 9 years ago

TODO: research celery and django celery

cr8ivecodesmith commented 9 years ago

The API from data.gov.ph seems to be working. We can access the api via something like:

http://api.data.gov.ph/catalogue/api/action/datastore_search?resource_id=314aa773-e6e4-4554-80ce-4f588212e0d1&limit=1

Each table corresponds to a particular resource. Click the more information on each table to know the resource here: http://data.gov.ph/catalogue/dataset/philgeps-public-data

The solution is to access 2 endpoints:

  1. datastore_search - The response from this endpoint will contain a total key that will indicate the number of records for the resource.
  2. datastore_search_sql - This will allow us to query the resource using sql range: i.e.
SELECT * FROM "_<resource_id>_" WHERE _id BETWEEN 1 AND 1000

This will allow us to loop through the resource up to the total number of records.

I suggest using Django's builtin paginator class to loop through the resource as it is quite memory efficient. A sample usage would look like this:

  1 from django.core.paginator import Paginator¬                                                        
  2 ¬                                                                                                   
  3 ¬                                                                                                   
  4 paginator = Paginator([i for i in range(1, max_id + 1)], 1000)¬                                     
  5 ¬                                                                                                   
  6 for page_num in paginator.page_range:¬                                                              
  7     page = paginator.page(page_num)¬                                                                
  8     process(page.object_list)¬                                                                      
  9 ¬                                                                                                   
 10 ¬                                                                                                   
 11 def process(ids):¬                                                                                  
 12     sql = 'SELECT * FROM "adfasfasdf" WHERE _id BETWEEN {} AND {}'¬                                 
 13     params = {¬                                                                                                 
 14         'sql': sql.format(ids[0], ids[-1])¬                                                         
 15     }¬                                                                                              
 16     res = requests.get(url, params=params)¬                                                         
 17     ....¬                                                                                           
~