Sinar / openownership-scripts

MIT License
0 stars 0 forks source link

Convert CIDB Company Ownership (Directors) data into OpenOwnership JSON Format #1

Open kaerumy opened 7 years ago

kaerumy commented 7 years ago

See Joint Ownership example as often companies have 2-3 directors http://beneficial-ownership-data-standard.readthedocs.io/en/master/examples.html

kmubiin commented 7 years ago

CIDB data were deceptively clean... The "directors" field itself is missing for some entries.

Then, even the field exist, there are 5 subfields: "idenfity_card_no", "name", "nationality", "shares", "year_of_experience" of which may be also empty!

Bad data overview (see WARNING):

Get list of JSONL files
Found files: 5
Read CIDB data from JSONL files
Read "directors" from contractors201509220.jsonl
Missing "directors": 8 (WARNING)
Found "directors": 4993
 of which empty: 2475 (WARNING)
 of which non-empty: 2518
  but empty subfield: 2890 (WARNING)
Total lines checked: 5001
Read "directors" from contractors201509221.jsonl
Missing "directors": 18 (WARNING)
Found "directors": 4983
 of which empty: 2451 (WARNING)
 of which non-empty: 2532
  but empty subfield: 2841 (WARNING)
Total lines checked: 5001
Read "directors" from contractors201509222.jsonl
Missing "directors": 9 (WARNING)
Found "directors": 4992
 of which empty: 2356 (WARNING)
 of which non-empty: 2636
  but empty subfield: 2854 (WARNING)
Total lines checked: 5001
Read "directors" from contractors201509224.jsonl
Missing "directors": 7 (WARNING)
Found "directors": 4994
 of which empty: 2417 (WARNING)
 of which non-empty: 2577
  but empty subfield: 2824 (WARNING)
Total lines checked: 5001
Read "directors" from contractors201509226.jsonl
Missing "directors": 14 (WARNING)
Found "directors": 4987
 of which empty: 2415 (WARNING)
 of which non-empty: 2572
  but empty subfield: 2980 (WARNING)
Total lines checked: 5001

Based on above, there are three levels of checking required:

  1. Check if "directors" field exist or else
  2. Check if "directors" field is empty or else
  3. Check if any of known subfields is empty or else

Hence the conversion script must implement above checking to pass.

P.S.: CIDB data took longer time to identify bad data compared to JKR or MyProcurement data. I hope I didn't miss other bad data besides my checking above... Will start writing script soon.

P.S.S.: Pasted correct output. Now contains "but empty subfield" information.

kmubiin commented 7 years ago

Heck, another unexpected bad data. I am not amused by hours I had to spent in order to discover why the converted output is different from expected.

Take a look at following example of "directors" field:

  "directors": [
    {
      "idenfity_card_no": "670802025959", 
      "name": "ROSLI BIN AHMAD", 
      "nationality": "MALAYSIA", 
      "shares": "10", 
      "year_of_experience": "6"
    }, 
    {
      "idenfity_card_no": "610907025483", 
      "name": "ABDUL GHANI BIN ISA RULHAK KHAN", 
      "nationality": "MALAYSIA", 
      "shares": "90", 
      "year_of_experience": "11"
    }
  ],

Notice any error? There is one error for each item. I won't spoil the answer.

P.S.: The bad side of "copy and paste" is, one wouldn't have noticed any bad data.

P.S.S.: More like typo-without-correction rather than typical error.

kmubiin commented 7 years ago

Preview output of early implementation:

When "directors" field is found

{
  "statementGroups": [
    {
      "beneficialOwnershipStatements": [
        {
          "#comment": "Firm details here", 
          "entity": {}, 
          "id": "f3a09fe220e0436f88dbd7a315c41331", 
          "interestedParty": {}, 
          "interests": [], 
          "statementDate": "2017-09-29"
        }, 
        {
          "#comment": "Director details here", 
          "entity": {}, 
          "id": "614421c4be53454893b4aa9ace8db2c2", 
          "interestedParty": {}, 
          "interests": [], 
          "statementDate": "2017-09-29"
        }, 
        {
          "#comment": "Director details here", 
          "entity": {}, 
          "id": "b53df34637d247889d96e53895d90cd3", 
          "interestedParty": {}, 
          "interests": [], 
          "statementDate": "2017-09-29"
        }
      ], 
      "id": "cea43e9ff4e74c70b174d94d50cedcb9-meta-70175"
    }
  ]
}

When "directors" field is missing or empty

{
  "statementGroups": [
    {
      "beneficialOwnershipStatements": [
        {
          "#comment": "Firm details here", 
          "entity": {}, 
          "id": "8c8caec6b1bc41baa1ee9f16edec2754", 
          "interestedParty": {}, 
          "interests": [], 
          "statementDate": "2017-09-29"
        }, 
        {
          "#comment": "No director details", 
          "entity": {}, 
          "id": "1f724069492c4324843542e3c28f09aa", 
          "interestedParty": {}, 
          "interests": [], 
          "statementDate": "2017-09-29"
        }
      ], 
      "id": "856435d4e7ac4d5f90fc6995a29a6fac-meta-89375"
    }
  ]
}

Above are the output after rewriting script for third time (after hours of confusion by bad data). The nested objects are not ready yet, due to inconsistent values for sub-fields in CIDB data.

Note that "#comment" fields are used to indicate what data will be inserted and not part of BODS schema (although used similarly in given examples i.e. "#summary").

Will push more commits in stages.

kmubiin commented 7 years ago

Again and again, I need to remind myself that bad data exists in source. Yet, I overlooked the fact that almost blank data may exist between valid data. Something like this:

{
  "Alamat Berdaftar seperti Didalam Sijil SSM": {
    "Alamat": "", 
    "Alamat 1": "", 
    "Alamat 2": "", 
    "Bandar": "", 
    "Emel": "", 
    "Fax": "", 
    "Negeri": "", 
    "Poskod": "", 
    "Telefon": ""
  }, 
  "Alamat Surat Menyurat": {
    "Alamat": "", 
    "Alamat 1": "", 
    "Alamat 2": "", 
    "Bandar": "", 
    "Emel": "", 
    "Fax": "", 
    "Negeri": "", 
    "Poskod": "", 
    "Telefon": ""
  }, 
  "Profil": {
    "Gred Kontraktor": "", 
    "Jenis Syarikat": "", 
    "Lain-lain Lesen": "-", 
    "Lesen Perdagangan": "-", 
    "Lesen Perniagaan": "-", 
    "Nombor Pendaftaran": "-", 
    "Nombor Pendaftaran Lain": "-", 
    "ROB": "-", 
    "ROC": "-", 
    "Tarikh Luput Sijil Pendaftaran CIDB": "-"
  }, 
  "directors": [], 
  "meta": {
    "id": "180144", 
    "status": ""
  }, 
  "name": "", 
  "projects": []
}

Bad data are not going to be any worse than this, I hope... Need to generate another half of BODS-CIDB data, but held back by this kind of overlooked issues.

P.S.: Cannot take shortcut to write the script. Must consider all kinds of bad data.

kmubiin commented 7 years ago

Preview output of implementation in progress:

When both firm and director details are missing

{
  "statementGroups": [
    {
      "beneficialOwnershipStatements": [
        {
          "#comment": "no firm details", 
          "entity": {}, 
          "id": "eb472ea501594069ae3bf3b7304b6f87", 
          "interestedParty": {}, 
          "interests": [
            {
              "annotations": [
                {
                  "description": "no firm details"
                }
              ], 
              "interestLevel": "unknown", 
              "share": {
                "exact": 0.0
              }, 
              "type": "shareholding"
            }
          ], 
          "statementDate": "2017-09-29"
        }, 
        {
          "#comment": "no director details", 
          "entity": {}, 
          "id": "05adbfabc0cc4b09aaac609397ebed2c", 
          "interestedParty": {}, 
          "interests": [
            {
              "annotations": [
                {
                  "description": "no shares or empty"
                }
              ], 
              "interestLevel": "unknown", 
              "share": {
                "exact": 0.0
              }, 
              "type": "shareholding"
            }
          ], 
          "statementDate": "2017-09-29"
        }
      ], 
      "id": "853bb23ed87d4e9bb666ce70494daa4c-meta-176063"
    }
  ]
}

So far, this would be the worst case scenario. Nested objects will be added in later commits.

P.S.: Updated preview output. Even few extra characters will increase time to run the script and will result in larger file sizes. Use short and concise comments whenever possible.

kmubiin commented 7 years ago

Okay... I think I found one mistake. When using null statement, "interests" field should be an empty list (existing implementation is not empty).

I will fix this first, before posting a new issue for manual validation.