Sinar / openownership-scripts

MIT License
0 stars 0 forks source link

Manual validation for BODS-CIDB data #2

Closed kmubiin closed 2 years ago

kmubiin commented 7 years ago

This issue is now marked as "invalid" and can be closed by the original author.

NOTE by the original author: Update README--fix deprecated text and links--and include commit with description to close this issue by special keywords on GitHub.

BODS format has had some changes, starting from 0.1 release in 2019 (archived). Similar to OCDS counterpart, BODS data review tool has been available from 2020 (archived). Because of the changes made from 0.1 release, whatever written based on the 0.1 draft is deprecated and now deemed invalid. As such, the converted BODS-CIDB data from four years ago will fail the validation as expected.

Whether the script/data should be rewritten/updated by future contributors or not, is a separate matter from this issue. As of today, no need to do manual validation and use the online tool instead.

ORIGINAL ISSUE (2017)

This follows after issue #1 that converts CIDB data to BODS format.

Unlike Open Contracting Data Standard (OCDS) that has stable version of schema and dedicated web site for validation, Benefitial Ownership Data Standard (BODS) has neither of those to this posted date.

Therefore, need more eyes to validate Malaysian data i.e. BODS-CIDB data.

How to validate

The only way to validate BODS at the moment is by following these steps:

  1. Read the docs for specification;
  2. Have a close look at given examples on GitHub repo (relevant ones are "joint ownership" and "null statement");
  3. Do manual validation (paste one line into a JSON file at a time, open JSON file in web browser, then use your eyes to cross-check fields, else?)

Firefox has a "Filter JSON" search box at upper-right corner, when viewing a JSON file. This is useful to quickly check particular string of ID that is supposedly shared between multiple statements.

Similar checking could be done by grep and other command line tools. But the easiest way is to open JSON in Firefox and check in pretty print or collapsible objects layout.

Known bad data

Several identified issues for Malaysia data i.e. BODS-CIDB data:

Noted workaround above are implemented in the script already.

Known bad data at worst

There is no workaround for bad data at worst, because source data itself is bad.

Known empty fields

Example JSON

Few examples of result data will be pasted as JSON in separate comments at below.

There are three kinds of result data:

  1. For both existing firm and director
  2. For existing firm but empty director
  3. For both empty firm and director (bad data)

The third kind (bad data) is likely not required to validate but pasted anyway for example.

Update 2017.10.02 Separate section for "Known bad data" and "Known empty fields" because these are two different things.

Update 2017.10.03 Review this issue, fix typo, add text, updated comments for new output. Add description and example data for each known things.

kmubiin commented 7 years ago

For both existing firm and director

{
  "statementGroups": [
    {
      "beneficialOwnershipStatements": [
        {
          "entity": {
            "addresses": [
              {
                "address": "BLOK B-7-20, PPR KG. BARU AIR PANAS, JALAN USAHAWAN 6, KUALA LUMPUR, WILAYAH PERSEKUTUAN",
                "country": "MY",
                "postCode": "53200",
                "type": "residence"
              },
              {
                "address": "BLOK B-7-20, PPR KG. BARU AIR PANAS, JALAN USAHAWAN 6, KUALA LUMPUR, WILAYAH PERSEKUTUAN",
                "country": "MY",
                "postCode": "53200",
                "type": "registered"
              }
            ],
            "foundingDate": "",
            "id": "d4da3ada2883490d9c8aecc07d3bd289",
            "identifiers": [
              {
                "id": "0120001010-WP060310",
                "schema": "CIDB-registered"
              }
            ],
            "jurisdiction": "MY",
            "name": "RENONGAN EMAS ENTERPRISE",
            "statementDate": "2017-10-03",
            "type": "registeredEntity"
          },
          "id": "7c9f62bc919646219a52ed1543399a5b",
          "interestedParty": {
            "id": "0c045fb351db4f7bba18a6b810c98dbc",
            "identifiers": [
              {
                "id": "4ad2fb1abdc847e5894d3332914fb65a",
                "schema": "UUID-HEX"
              }
            ],
            "name": "Joint shareholding",
            "statementDate": "2017-10-03",
            "type": "arrangement"
          },
          "interests": [
            {
              "interestLevel": "direct",
              "share": {
                "exact": 100
              },
              "type": "shareholding"
            }
          ],
          "statementDate": "2017-10-03"
        },
        {
          "entity": {
            "foundingDate": "",
            "id": "9da1f9d9006c4e778013d3050eadf5ee",
            "identifiers": [
              {
                "id": "4ad2fb1abdc847e5894d3332914fb65a",
                "schema": "UUID-HEX"
              }
            ],
            "jurisdiction": "MY",
            "name": "Joint shareholding",
            "statementDate": "2017-10-03",
            "type": "arrangement"
          },
          "id": "14d164627c2d4e778c820bad9471a7fb",
          "interestedParty": {
            "id": "f0ef4de0a0db499cb4c5886242e620f2",
            "identifiers": [
              {
                "id": "MYS-IDCARD-670802025959",
                "schema": "id-card"
              }
            ],
            "name": "ROSLI BIN AHMAD",
            "nationalities": [
              "MY"
            ],
            "statementDate": "2017-10-03",
            "type": "naturalPerson"
          },
          "interests": [
            {
              "interestLevel": "direct",
              "share": {
                "exact": 10
              },
              "type": "shareholding"
            }
          ],
          "statementDate": "2017-10-03"
        },
        {
          "entity": {
            "foundingDate": "",
            "id": "1f088778816c431499017f433b1d31bb",
            "identifiers": [
              {
                "id": "4ad2fb1abdc847e5894d3332914fb65a",
                "schema": "UUID-HEX"
              }
            ],
            "jurisdiction": "MY",
            "name": "Joint shareholding",
            "statementDate": "2017-10-03",
            "type": "arrangement"
          },
          "id": "cc3bcc43f8a2405fabb95d817ffc4f23",
          "interestedParty": {
            "id": "65099a60e561434db5f11d6ac0dc0005",
            "identifiers": [
              {
                "id": "MYS-IDCARD-610907025483",
                "schema": "id-card"
              }
            ],
            "name": "ABDUL GHANI BIN ISA RULHAK KHAN",
            "nationalities": [
              "MY"
            ],
            "statementDate": "2017-10-03",
            "type": "naturalPerson"
          },
          "interests": [
            {
              "interestLevel": "direct",
              "share": {
                "exact": 90
              },
              "type": "shareholding"
            }
          ],
          "statementDate": "2017-10-03"
        }
      ],
      "id": "7ac9d1b8b1bb4470a2ecfbcbfd9585c8-meta-70175"
    }
  ]
}

From line 1 of ./data/bods-contractors201509220.jsonl

Remarks

Pasted new output at above. Except for "id" instances that are newly generated at runtime, there is no changes for this case; This also indicates that the pushed commits introduce the changes correctly.

Changes should be seen only in case of "for existing firm but empty director" and "for both empty firm and director (bad data)". See updated comments at below.

kmubiin commented 7 years ago

For existing firm but empty director

{
  "statementGroups": [
    {
      "beneficialOwnershipStatements": [
        {
          "entity": {
            "addresses": [
              {
                "address": "PETI SURAT 36, MEMBAKUT, SABAH", 
                "country": "MY", 
                "postCode": "89727", 
                "type": "residence"
              }, 
              {
                "address": "PETI SURAT 36, MEMBAKUT, SABAH", 
                "country": "MY", 
                "postCode": "89727", 
                "type": "registered"
              }
            ], 
            "foundingDate": "", 
            "id": "5c67cd53b0ae4614a791ef217bd28024", 
            "identifiers": [
              {
                "id": "0120021209-SB078274", 
                "schema": "CIDB-registered"
              }
            ], 
            "jurisdiction": "MY", 
            "name": "B & J ENTERPRISE", 
            "statementDate": "2017-10-03", 
            "type": "registeredEntity"
          }, 
          "id": "852e3f93595a43b28d4e3f66c78daf80", 
          "interestedParty": {
            "description": "no beneficial owner in source", 
            "type": "unknown"
          }, 
          "interests": [], 
          "statementDate": "2017-10-03"
        }
      ], 
      "id": "538c9515727748b7b3304af3d4fb7c67-meta-89375"
    }
  ]
}

From line 7 of ./data/bods-contractors201509220.jsonl

Remarks

Current implementation may be wrong. When beneficial owner is not found, there should be only one statement instead of two (now fixed and pasted new output at above).

kmubiin commented 7 years ago

For both empty firm and director

{
  "statementGroups": [
    {
      "beneficialOwnershipStatements": [
        {
          "entity": {
            "addresses": [], 
            "foundingDate": "", 
            "id": "423e9f12e0fa4ee89a46aa83d684f59a", 
            "identifiers": [
              {
                "id": "", 
                "schema": ""
              }
            ], 
            "jurisdiction": "MY", 
            "name": "Joint shareholding", 
            "statementDate": "2017-10-03", 
            "type": "unknownEntity"
          }, 
          "id": "b12c42c3c1c042f6af702ce8fccda50b", 
          "interestedParty": {
            "description": "no beneficial owner in source", 
            "type": "unknown"
          }, 
          "interests": [], 
          "statementDate": "2017-10-03"
        }
      ], 
      "id": "811116e767394728bcb780832578eab4-meta-176063"
    }
  ]
}

From line 83 of ./data/bods-contractors201509220.jsonl

Remarks

This is an example of result that likely not required to validate and probably safe to ignore, since the source had this kind of invalid entries found between other entries (updated new output at above).

Invalid entry for the same line in source:

{
  "Alamat Berdaftar seperti Didalam Sijil SSM": {
    "Alamat": "", 
    "Alamat 1": "", 
    "Alamat 2": "", 
    "Bandar": "", 
    "Emel": "", 
    "Fax": "", 
    "Negeri": "", 
    "Poskod": "", 
    "Telefon": ""
  }, 
  "Alamat Surat Menyurat": {
    "Alamat": "", 
    "Alamat 1": "", 
    "Alamat 2": "", 
    "Bandar": "", 
    "Emel": "", 
    "Fax": "", 
    "Negeri": "", 
    "Poskod": "", 
    "Telefon": ""
  }, 
  "Profil": {
    "Gred Kontraktor": "", 
    "Jenis Syarikat": "", 
    "Lain-lain Lesen": "-", 
    "Lesen Perdagangan": "-", 
    "Lesen Perniagaan": "-", 
    "Nombor Pendaftaran": "-", 
    "Nombor Pendaftaran Lain": "-", 
    "ROB": "-", 
    "ROC": "-", 
    "Tarikh Luput Sijil Pendaftaran CIDB": "-"
  }, 
  "directors": [], 
  "meta": {
    "id": "176063", 
    "status": ""
  }, 
  "name": "", 
  "projects": []
}

In other words, there is nothing to validate in this example of bad data, unlike previous example that has "existing firm but empty director".

kmubiin commented 7 years ago

First self-validation

According to null statement example on GitHub repo, when no beneficial owner is located, there is no statement at all for person and make use of NullParty component in BODS docs in the only statement.

In comparison, the current implementation by script still generate statement for person and make use of NullParty component in statement for person (instead of NullParty component in statement for firm).

So existing script might be wrong. In other words, there should be only one statement instead of two when beneficial owner is not found.

I will fix this soon.

P.S.: Reviewed fix in commit that is referenced below this comment. Actual fix was done in stages as several commits before the referenced one.