CivicDataLab / IDS-DRR-Assam

Intelligent Data Solution - Disaster Risk Reduction is a system to assist flood management in the state of Assam through data-driven ways. The repository contains codes to extract relevant datasets and the modelling approach used to calculate Risk Scores for each revenue circle in Assam.
GNU Affero General Public License v3.0
0 stars 0 forks source link

IDS-DRR Tenders Source Pipeline #10

Closed d-saikrishna closed 8 months ago

d-saikrishna commented 1 year ago

Sub-thread of #8 for Assam Tenders

@shreyaagrawal0809 will provide the scraper scripts - updated with a column for time of scraping

I will provide scripts for data transformation.

d-saikrishna commented 1 year ago

@shreyaagrawal0809 could you add the scraper code here: Link

With the output of the scraper code here: Link

I've added two codes:

  1. flood_tenders.py: Code which identifies flood_tenders out of all tenders and extracts relevant metadata.
  2. geocode_district.py: Geo-code district of the tender.

Will add one more code for geo-coding revenue circles tenders.

We are not able to geocode all districts. Have to decide methodology to deal with those tenders. Previously, we manually geocoded those tenders.

apoorv74 commented 1 year ago

@d-saikrishna - let's have a call to discuss the geocoding part.

What percentage of tenders were you able to geocode ?

d-saikrishna commented 1 year ago

For ~3000/4000 tenders, District is identified. For another ~350 tenders, multiple districts are identified. These are 'CONFLICT' tenders.

For revenue circle identification, the number will be lesser. Will update

d-saikrishna commented 1 year ago

Need to decide on the following wrt geotagging tenders

  1. How to resolve CONFLICT Districts after geotagging districts?
  2. Can we use location column to geotag revenue circle?
  3. Can we use fuzzy matching to geotag? Even if we use any AI for it, the output will be probabilistic.
  4. Will manual steps be part of the pipeline?

Meanwhile, I'm trying to expand the villages dataset by combining other sources so that there can be more absolute matches.

d-saikrishna commented 1 year ago

Number of tenders whose revenue circle could not be geo-tagged: 1192

This number can be reduced by:

  1. Using location column
  2. Resolving CONFLICT districts
  3. Using fuzzy matching
  4. Manual geotagging
d-saikrishna commented 1 year ago

New logic for geo-tagging revenue circles:

  1. tender_revenueci column is based on title, work description and extReference ID columns
  2. tender_revenueci_location column is based on location column
  3. If the RC identified at tender_revenueci_location column is a HeadQuarter, then we flag it accordingly in the HQ_flag column.

IF tender_revenueci_location remains null then RC in tender_revenueci is decided as FINAL IF HQ_Flag == False -- Then RC in tender_revenueci_location is decided as FINAL IF (HQ_Flag == True) AND (tender_revenueci_location ==tender_revenueci) -- Then RC in tender_revenueci_location is decided as FINAL

Yet to Decide IF (HQ_Flag == True) AND (tender_revenueci_location !=tender_revenueci)

d-saikrishna commented 1 year ago

Few decisions taken on TENDERS data source.

  1. We are considering only tenders that were awarded in the model [AOC Tenders] Previously we took all tenders - even cancelled tenders. Accordingly, I'm scraping only AOC tenders from the website.
  2. Tenders are stored month wise.

New tender stats accordingly. For the AOC tenders scraped between 2016 April to 2023 September:

  1. 17965 tenders were awarded.
  2. Total number of flood related tenders: 2368
  3. Number of tenders whose district could not be geo-tagged: 385
  4. Number of tenders whose district identification is a CONFLICT: 119
  5. Number of tenders whose revenue circle could not be geo-tagged: 684 (Includes the 385+119 tenders for which district could not be identified)
d-saikrishna commented 1 year ago

Biswajeet: The geo-tagging exercise for 2023 tenders is complete. Total missing RCs - 84. RCs geotagged manually - 68. Tenders for which RCs cannot be determined - 16.

Should now create variables from tenders datasources

d-saikrishna commented 1 year ago

All variables for TENDERS processed until September 2023

d-saikrishna commented 10 months ago

Need to reclassify tenders:

https://docs.google.com/document/d/1hO53-Fw-oXV1knHsirKu-r_kbcQMcz38O2bipbDXmGM/edit#heading=h.2xi90qidppty