Open kaerumy opened 7 years ago
As today, I already have the idea of the data modeling for this project.
Each of the objects from jkr_jsonl awards (keputusan tender) will be split into 3 collections (awards, buyers, sellers) in MongoDB. The objects will be linked to each other by copying id of inserted objects to objects in respective collections. I suppose "offering_office" is "buyer" and "contractor" is "seller".
I have only tested code to get object id as string (instead of pymongo object). Yet to test further for above idea, since implementation is not straightforward. I'd probably work on pseudo code first.
The source data contains many bad data entries. Not just lettercases, but many mixed typos included. The following are the mess that I have observed and identified briefly.
Inconsistent lettercase (Fix level: easy)
offering_office: Pejabat Jurutera Daerah Seberang Perai Utara. contractor: offering_office: PEJABAT JURUTERA DAERAH SEBERANG PERAI UTARA. contractor:
Inconsistent lettercase with extra character (Fix level: relatively easy)
offering_office: Ibu Pejabat JKR Pahang (Bhgn Kontrak & Ukur Bahan) <-- contractor: Olak Timur Resources Sdn Bhd offering_office: IBU PEJABAT JKR PAHANG (BHGN.KONTRAK & UKUR BAHAN) <-- contractor: Sinaran Kembar Enterprise
Typo and additional newline (Fix level: ?)
offering_office: Pejabat Jurutera Daerah Seberang Perai Utara. contractor: offering_office: Pejabat Jurutera Daerah <-- apparently contains a newline? Seberang Perai Utrara. <-- correct name is "Utara" contractor:
Inconsistent abbreviations (Fix level: ?)
offering_office: J.K.R PERAK <-- JKR or J.K.R? contractor: SHAHI SDN BHD offering_office: JKR Perak <-- JKR right? contractor: Arida Usaha Niaga offering_office: JKR NEGERI PERAK D.RIDZUAN <-- not Perak but Perak Darul Ridzuan? contractor: ZAINAL AB CONSTRUCTION
Inconsistent addresses (Fix level: ?)
offering_office: Cawangan Kejuruteraan Elektrik JKR Kedah Darul Aman,Alor Star,Kedah DA. <-- contractor: Iktismel Asasi Engineering Sdn Bhd offering_office: Cawangan Kejuruteraan Elektrik JKR Kedah,05582 Alor Star,Kedah Darul Aman. <-- contractor: Iktisas Asasi Engineering Sdn Bhd
Inconsistent names of branch (Fix level: ?)
offering_office: JKR Selangor. <-- Selangor right? contractor: SISJ Usaha Niaga Sdn Bhd offering_office: Bahagian Kontrak & Ukur Bahan,JKR Negeri Selangor. <-- dept. name first? contractor: CGE Construction Sdn. Bhd. offering_office: JKR Selangor Darul Ehsan. <-- wait, Selangor or Selangor Darul Ehsan? contractor: Cendekia Teknik Sdn Bhd offering_office: JKR Negeri Selangor. <-- Negeri Selangor or Selangor? contractor: Eureka Cekap Sdn Bhd offering_office: JKR SELANGOR (JKR SABAK BERNAM) <-- is this still Selangor or different one? contractor: HNS Teguh Resources
Invalid contractor name (Fix level: ?)
offering_office: Ibu Pejabat JKR Pahang, (Bhgn. Kontrak & Ukur Bahan). contractor: Tender Semula <-- is this a remark?
And these are just few of errata that I have observed when browsing print output of 3755 entries from JKR awards... before I could even start to implement the pseudo code!
The thing is, there shouldn't be such mess before storing to MongoDB, otherwise difficult to identify or even remove any duplicates of "sellers" and "buyers" in the database.
So far there is no plan yet on how to fix the mess.
Overall mess that I am seeing from print output, whilst working on the pseudo code:
Total buyers processed: 3755 Total sellers processed: 3438 of which seller is "Tender Semula": 34 of which seller is "2. TENDER DIBATALKAN": 24
Note that above figures are just checking out how many entries are non-empty strings.
Still need to filter out those invalid and duplicate entries, before could be stored into the "buyers" and "sellers" collections. Currently, can't even compare for duplicates because of all those inconsistent strings (as noted in earlier comment).
The mess is expected, and our OCDS is the clean standard output (as best as we can make it)
On Mon, 4 Sep 2017 at 19:58 Mubiin Kimura notifications@github.com wrote:
Overall mess that I am seeing from print output, whilst working on the pseudo code:
Total buyers processed: 3755 Total sellers processed: 3438 of which seller is "Tender Semula": 34 of which seller is "2. TENDER DIBATALKAN": 24
Note that above figures are just checking out how many entries are non-empty strings.
Still need to filter out those invalid and duplicate entries, before could be stored into the "buyers" and "sellers" collections. Currently, can't even compare for duplicates because of all those inconsistent strings (as noted in earlier comment).
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/Sinar/telus/issues/7#issuecomment-326957377, or mute the thread https://github.com/notifications/unsubscribe-auth/ABAvhzBP86loCNPjk85dVaC7EK7RJSO7ks5se_QCgaJpZM4O82Kx .
As commit b0f5825 today:
The source data is hard-coded with fpath
variable in main script at ./../blob/master/telus.py
The values for awards, buyers and sellers are returned as JSON at respective route /awards
, /buyers
and /sellers
in web framework script at ./../blob/master/telusweb.py
Regarding point 2, for now, the result only returns the first object from respective collection in MongoDB. The result may be manipulated to return other objects using other prepared function with argument.
Then, in case of multiple objects are found, the resulting object to be parsed would contain a list of id of the multiple objects found... Probably?
In fact, only one object that may be parsed as response and rendered as JSON in Web browser.
Update 2021.12.25: A long overdue action. Many issues at Sinar/telus were all beyond my understanding and capability as an intern, and even now. Some four years later, unassigned myself to give way to other assignees, should this project continue.
Separate tenders, awards and buyer/seller (organization schema) as separate nested objects in MongoDB.