Open kaerumy opened 7 years ago
CIDB data were deceptively clean... The "directors" field itself is missing for some entries.
Then, even the field exist, there are 5 subfields: "idenfity_card_no", "name", "nationality", "shares", "year_of_experience" of which may be also empty!
Bad data overview (see WARNING):
Get list of JSONL files
Found files: 5
Read CIDB data from JSONL files
Read "directors" from contractors201509220.jsonl
Missing "directors": 8 (WARNING)
Found "directors": 4993
of which empty: 2475 (WARNING)
of which non-empty: 2518
but empty subfield: 2890 (WARNING)
Total lines checked: 5001
Read "directors" from contractors201509221.jsonl
Missing "directors": 18 (WARNING)
Found "directors": 4983
of which empty: 2451 (WARNING)
of which non-empty: 2532
but empty subfield: 2841 (WARNING)
Total lines checked: 5001
Read "directors" from contractors201509222.jsonl
Missing "directors": 9 (WARNING)
Found "directors": 4992
of which empty: 2356 (WARNING)
of which non-empty: 2636
but empty subfield: 2854 (WARNING)
Total lines checked: 5001
Read "directors" from contractors201509224.jsonl
Missing "directors": 7 (WARNING)
Found "directors": 4994
of which empty: 2417 (WARNING)
of which non-empty: 2577
but empty subfield: 2824 (WARNING)
Total lines checked: 5001
Read "directors" from contractors201509226.jsonl
Missing "directors": 14 (WARNING)
Found "directors": 4987
of which empty: 2415 (WARNING)
of which non-empty: 2572
but empty subfield: 2980 (WARNING)
Total lines checked: 5001
Based on above, there are three levels of checking required:
Hence the conversion script must implement above checking to pass.
P.S.: CIDB data took longer time to identify bad data compared to JKR or MyProcurement data. I hope I didn't miss other bad data besides my checking above... Will start writing script soon.
P.S.S.: Pasted correct output. Now contains "but empty subfield" information.
Heck, another unexpected bad data. I am not amused by hours I had to spent in order to discover why the converted output is different from expected.
Take a look at following example of "directors" field:
"directors": [
{
"idenfity_card_no": "670802025959",
"name": "ROSLI BIN AHMAD",
"nationality": "MALAYSIA",
"shares": "10",
"year_of_experience": "6"
},
{
"idenfity_card_no": "610907025483",
"name": "ABDUL GHANI BIN ISA RULHAK KHAN",
"nationality": "MALAYSIA",
"shares": "90",
"year_of_experience": "11"
}
],
Notice any error? There is one error for each item. I won't spoil the answer.
P.S.: The bad side of "copy and paste" is, one wouldn't have noticed any bad data.
P.S.S.: More like typo-without-correction rather than typical error.
Preview output of early implementation:
When "directors" field is found
{
"statementGroups": [
{
"beneficialOwnershipStatements": [
{
"#comment": "Firm details here",
"entity": {},
"id": "f3a09fe220e0436f88dbd7a315c41331",
"interestedParty": {},
"interests": [],
"statementDate": "2017-09-29"
},
{
"#comment": "Director details here",
"entity": {},
"id": "614421c4be53454893b4aa9ace8db2c2",
"interestedParty": {},
"interests": [],
"statementDate": "2017-09-29"
},
{
"#comment": "Director details here",
"entity": {},
"id": "b53df34637d247889d96e53895d90cd3",
"interestedParty": {},
"interests": [],
"statementDate": "2017-09-29"
}
],
"id": "cea43e9ff4e74c70b174d94d50cedcb9-meta-70175"
}
]
}
When "directors" field is missing or empty
{
"statementGroups": [
{
"beneficialOwnershipStatements": [
{
"#comment": "Firm details here",
"entity": {},
"id": "8c8caec6b1bc41baa1ee9f16edec2754",
"interestedParty": {},
"interests": [],
"statementDate": "2017-09-29"
},
{
"#comment": "No director details",
"entity": {},
"id": "1f724069492c4324843542e3c28f09aa",
"interestedParty": {},
"interests": [],
"statementDate": "2017-09-29"
}
],
"id": "856435d4e7ac4d5f90fc6995a29a6fac-meta-89375"
}
]
}
Above are the output after rewriting script for third time (after hours of confusion by bad data). The nested objects are not ready yet, due to inconsistent values for sub-fields in CIDB data.
Note that "#comment" fields are used to indicate what data will be inserted and not part of BODS schema (although used similarly in given examples i.e. "#summary").
Will push more commits in stages.
Again and again, I need to remind myself that bad data exists in source. Yet, I overlooked the fact that almost blank data may exist between valid data. Something like this:
{
"Alamat Berdaftar seperti Didalam Sijil SSM": {
"Alamat": "",
"Alamat 1": "",
"Alamat 2": "",
"Bandar": "",
"Emel": "",
"Fax": "",
"Negeri": "",
"Poskod": "",
"Telefon": ""
},
"Alamat Surat Menyurat": {
"Alamat": "",
"Alamat 1": "",
"Alamat 2": "",
"Bandar": "",
"Emel": "",
"Fax": "",
"Negeri": "",
"Poskod": "",
"Telefon": ""
},
"Profil": {
"Gred Kontraktor": "",
"Jenis Syarikat": "",
"Lain-lain Lesen": "-",
"Lesen Perdagangan": "-",
"Lesen Perniagaan": "-",
"Nombor Pendaftaran": "-",
"Nombor Pendaftaran Lain": "-",
"ROB": "-",
"ROC": "-",
"Tarikh Luput Sijil Pendaftaran CIDB": "-"
},
"directors": [],
"meta": {
"id": "180144",
"status": ""
},
"name": "",
"projects": []
}
Bad data are not going to be any worse than this, I hope... Need to generate another half of BODS-CIDB data, but held back by this kind of overlooked issues.
P.S.: Cannot take shortcut to write the script. Must consider all kinds of bad data.
Preview output of implementation in progress:
When both firm and director details are missing
{
"statementGroups": [
{
"beneficialOwnershipStatements": [
{
"#comment": "no firm details",
"entity": {},
"id": "eb472ea501594069ae3bf3b7304b6f87",
"interestedParty": {},
"interests": [
{
"annotations": [
{
"description": "no firm details"
}
],
"interestLevel": "unknown",
"share": {
"exact": 0.0
},
"type": "shareholding"
}
],
"statementDate": "2017-09-29"
},
{
"#comment": "no director details",
"entity": {},
"id": "05adbfabc0cc4b09aaac609397ebed2c",
"interestedParty": {},
"interests": [
{
"annotations": [
{
"description": "no shares or empty"
}
],
"interestLevel": "unknown",
"share": {
"exact": 0.0
},
"type": "shareholding"
}
],
"statementDate": "2017-09-29"
}
],
"id": "853bb23ed87d4e9bb666ce70494daa4c-meta-176063"
}
]
}
So far, this would be the worst case scenario. Nested objects will be added in later commits.
P.S.: Updated preview output. Even few extra characters will increase time to run the script and will result in larger file sizes. Use short and concise comments whenever possible.
Okay... I think I found one mistake. When using null statement, "interests" field should be an empty list (existing implementation is not empty).
I will fix this first, before posting a new issue for manual validation.
See Joint Ownership example as often companies have 2-3 directors http://beneficial-ownership-data-standard.readthedocs.io/en/master/examples.html