Sinar / telus

Joined Up Data transparency project for PEPs, OCDS & Beneficial Ownership
GNU General Public License v3.0
0 stars 0 forks source link

Store data into database or record file #4

Closed kmubiin closed 7 years ago

kmubiin commented 7 years ago

Store contracting record that complies to Open Contracting Data Standard (OCDS) from JSON file into SQLite database or MongoDB record file. The following would be the steps to get started.

  1. Mock data shall be prepared in JSON file format using random data generator website. JSON file format must be used to include nested elements of record, which is found in contracting records.

  2. Read mock data from JSON file into SQLite database or MongoDB record file.

  3. Confirm the mock data have been saved in SQLite database or MongoDB record file.

For the last step, confirm by using print command to see output in terminal emulator. Subsequent actions shall be created as a new issue for each.

REMARKS Don't use random data generator website i.e. Mockaroo, generatedata.com, unless one could clarify that the data structure is indeed correct. Either use real sample data, if available, or prepare small set of dummy data manually.

IMPORTANT Always commit code to the 'testing' branch to test the methods.

kmubiin commented 7 years ago

I shouldn't have relied on the random data generator website to prepare the dummy data. The resulting JSON data that contain nested elements had wrong data structure. Ended up confusing myself for many hours between 'list' and 'dictionary' types.

Finally got things right. All three steps above have been achieved on my local machine. Working example code with dummy data will be uploaded by today.

kmubiin commented 7 years ago

Note that dummy data is only to test storing nested elements of JSON data into SQLite database. Therefore, the first working code is not intended to be run against actual data.

Some modifications are required to parse the actual data that complies to OCDS.

REMARKS Use actual nested elements of JSON data from Popit API, then try to save all information of several persons in the SQLite database file. First working code doesn't demonstrate this yet.

kmubiin commented 7 years ago

This commit https://github.com/Sinar/telus/commit/02880c14cdcd8158114d7e1f62b51c60d8c7254b has included 'test4.py', which shows how to read JSON data, save/read using SQLite database, then output as JSON data again.

In this attempt, only tried using dummy record. However, the dummy record uses compatible fields of actual nested elements of JSON data from Popit API. So far this attempt is closest to real.

Next is to try out other variations:

  1. using JSON file (easy)
  2. query actual nested elements of JSON data from Popit API (relatively easy but deceptive)
  3. combine data from both 1 and 2 and return as combined JSON literal (huh)

The last variation is logical and somewhat makes sense, but yet to try with code.

UPDATE Wait a minute... There is no reason to combine data from both 1 and 2, since the result from both are using compatible fields (duplicate fields with different values). Then, must use a different dummy record that uses different fields that are not used by JSON data from Popit API.

REMARKS First two variations shall be uploaded as 'test4-1.py' and 'test4-2.py'. Last variation shall be tested as different code i.e. 'test5-N.py' and follow up as different issue.

kmubiin commented 7 years ago

As 2017.08.07, data is either to be stored in MongoDB record or SQLite database. The main choice is MongoDB record, which stores data as BSON (not JSON).

Have to redo at least 'test4.py' using MongoDB.

UPDATE Next is to use JSON data that contain several nested dictionaries and figure out how to store one by one or all at once to MongoDB record, then output again those several nested dictionaries. Huh.

kmubiin commented 7 years ago

In this issue, there are two things to concern about:

  1. Whether data has been contained in valid JSON literal
  2. Whether data should be stored "as it is" or filter out some information.

Regarding point 1, imagine you have downloaded the data as JSON file that weigh 6MB. Then, you simply wanted to view JSON in web browser. But instead you saw this error:

SyntaxError: JSON.parse: expected ':' after property name in object at line 40042 column 53 of the JSON data

UPDATE The SyntaxError above was apparently a bug in Firefox 54.x release on Linux. The same JSON file is now rendered without error in Firefox 55.0 release on Linux.

There is a need to include code to do basic validation for data provided by JSON files. That is, to throw exception whenever JSON file is found to be improper.

kmubiin commented 7 years ago

As 2017.08.17, I am still stucked at this issue. The fact is, "obvious" things that beginning users should know are just not well-explained on the web. I finally realized after I did a lot of trials and errors.

These are two "obvious" things:

  1. Don't use mock data.

    Use real sample data, if you must. However, such small set of data is difficult to be found readily. The simplicity of mock data will mislead you to use simple approach that likely won't work with actual data.

    As a result, you might end up using using conventional approach i.e. nested loops to access the objects, that flawlessly works for given the small set of data.

    But nested loops approach is not meant to deal with larger set of data. Instead, use database capabilities i.e. database API to filter and access the data.

  2. Examine JSON file beforehand.

    A JSON file, even it contain nested list of dictionaries, is always treated as whole when importing into document database i.e. MongoDB. The following content of JSON file is treated as one object.

    {
        "field_a": "",
        "field_b": [ {"object_1":{}}, {"object_2":{}}, {"object_3":{}} ],
        "field_c": "",
        "field_d": ""
    }

    Despite JSON file contains several objects under "field_b" element, because of these top-level elements, the entire file including the nested list of objects would be seen as one object.

    The following content of JSON file is treated as several objects.

    [ { "object_1":{} }, { "object_2":{} }, { "object_3":{} } ]

    The nested list of objects are now exposed at top-level, which will allow each objects to be imported as separate objects into the database.

P.S.: I should do a short write-up about this, using real sample data and commands that I had tried.

P.S.S.: Or no write-up even this time. If I can't reproduce the understanding by myself, there is no point of write-up. Things are more complicated than I thought initially.

kmubiin commented 7 years ago

The de-facto standard file format to store files in ./data directory is JSONL.

JSONL files contain multiple JSON objects in multiple lines, which places one JSON object per line. Despite arranged like a list, JSONL files use only newlines to separate the JSON objects.

{"string_1":  "one"}
{"string_2":  "two"}
{"string_3":  "three"}

In contrast, regular JSON files contain JSON objects that are either nested in a top-level list (Sample JSON file 1) or in a top-level object that contains a list in the value (Sample JSON file 2).

Sample JSON file 1:

[
    {"string_1":  "one"},
    {"string_2":  "two"},
    {"string_3":  "three"}
]

Sample JSON file 2:

{
    "data": [
        {"string_1":  "one"},
        {"string_2":  "two"},
        {"string_3":  "three"}
    ]
}

JSONL files by itself are not valid JSON files. Sample JSON file 1 and file 2 are valid JSON files, which can be rendered directly as JSON in a Web browser.

If the data to be stored were found as other file format i.e. CSV, then write a separate script to convert those files into JSONL file format.

The script may not be part of modules and main script because of the one-off usage to convert the files. Instead, upload script to the repository in ./scripts directory (not created yet to this date).

kmubiin commented 7 years ago

This issue was originally created based on Trello card given to me. Now deprecated.