MobilityNet / mobilitynet.github.io

BSD 3-Clause "New" or "Revised" License
0 stars 3 forks source link

Figure out the JSON file structure to export the data and make it "cloud optimized" #18

Open shankari opened 3 years ago

shankari commented 3 years ago

We want to host this in OEDI, which will use AWS resources. So they basically need a easy to access machine-readable format.

We discussed two formats:

e-mission can already export data from mongodb in JSON format

However, our scripts currently don't load one day's worth of data at a time. Instead, we download the spec, and then use the spec to pull chunks of data across multiple days (that represent multiple repetitions of the timeline) into one notebook for comparison.

We need to:

shankari commented 3 years ago

For CSV, we would need to split out the entries for a particular day into multiple csvs, one for each type of data. We would also need to flatten the object structure. Examples of flattening the object structures are here: https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/builtin_timeseries.py#L269 which calls https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/builtin_timeseries.py#L149

This is definitely lower priority than getting the JSON working end to end, but I wanted to document it for the record.

shankari commented 3 years ago

one potential solution for the JSON is: concatenate the input parameters for every retrieval call to represent the filename that has the data.

So if we retrieve data for:

then we create a JSON file called data_manual-evaluation_transition_111111_22222

we go through and do this for all API calls that we currently execute we can then replace the name of retrieve_data_from_server to read_data_from_file and replace the implementation from a REST API call to a file read.

shankari commented 3 years ago

we can then move the retrieve_data_from_server code to a script that pulls data from a server and creates files

So two main changes:

shankari commented 3 years ago

the script can be in bin - I think there may be a similar script that pulls data and stores it to a local mongodb. we would want to store to a file instead.

singhish commented 3 years ago

Design Document - File-based Data Retrieval

(bin/dump_data_to_file.py)

Input

Required flags:

Optional flags:

Output

The output of the server call as specified by the input. This will be an output file in the format data_{endpoint}_{start timestamp}_{end timestamp}.json that contains the JSON data returned from the server call.

E.g. if your input flags are --user shankari@eecs.berkeley.edu and --endpoint config/evaluation_spec, the output file will be named data_config~evaluation-spec_0_16593840.json.

Valid Endpoints

Design Choices

It is worth noting that the original data retrieval mechanism in spec_details.py makes use of an object to maintain information about the spec ID and the datastore URL. The spec ID is not needed to retrieve data from the server, so it is omitted from this script.

shankari commented 3 years ago

I would be a little careful about terminology. manual/evaluation_transitions is not really an endpoint in the classic REST API sense (we don't call GET /foo/bar/endpoint), it is a timeseries key that, along with the start and end timestamps, is a parameter to the POST command.

Also, as we discussed today, please have a --all argument that retrieves all the data associated with a spec in the way the current code does.

singhish commented 3 years ago

Spent some time today thinking through this -- is this updated design doc satisfactory? I realized having an --all flag complicates things; we can just control the behavior of the script depending on the parameters that are passed in:


Design Document - File-based Data Retrieval

(bin/dump_data_to_file.py)

Parameters

dump_data_to_file.py can accept optional inputs as command line parameters to control the data being dumped to JSON files. They are classified into independent parameters and co-dependent parameters, the latter of which must all be specified if one is specified.

Independent parameters:

Dependent parameters:

Output

dump_data_to_file.py outputs JSON files with the following naming format:

data_{spec id}_{time series key}_{start timestamp}_{end timestamp}.json

The number of JSON files in this format is dependent on the parameters that are passed in. Specifically, a JSON file is outputted for each combination of spec ID and time series key in conjunction with the timestamp ranges that are explored at each step of the process. The algorithms to accomplish this are the same those used in the SpecDetails and PhoneView classes.

shankari commented 3 years ago

I think this is enough to be going on with. have you already started implementing in parallel?

singhish commented 3 years ago

Notes regarding the decision to parse command line arguments using the following pattern:

if any(arg in sys.argv for arg in ["--key", "--user", "--start-ts", "--end-ts"]):
        parser.add_argument("--key", type=str, required=True)
        parser.add_argument("--user", type=str, required=True)
        parser.add_argument("--start-ts", type=float, required=True)
        parser.add_argument("--end-ts", type=float, required=True)

argparse does provide support for subparsers, as discussed last week. However, the issue with subparsers is that an extra flag must typically specified for them to work (as seen here: https://docs.python.org/3/library/argparse.html). They are moreso suited as switches to separate different pieces of functionality within a program, rather than as a means of specifying a subset of intended functionality as is being done here. This is why I am choosing to implement the additional parameters as done above.

This approach also eliminates the need for an --all command, as not specifying either of the above four keys would assume "--all" functionality.

singhish commented 3 years ago

A few updates regarding the current implementation of the script:

1) Added an object called Spec to handle the information returned by the initial config/evaluation_spec REST API call. This will probably evolve to some semblance of the original SpecDetails object as I wrap up the phone_view-related calls tomorrow. The important thing to note, however, is that this Spec object decouples API calls from the flow of data, as SpecDetails by design handled both.

2) It is worth noting that E-Mission stores every version of a spec that is uploaded to it. Thus, to handle specs with the same spec_id, I am selecting the most recently uploaded spec based on the value of ["metadata"]["write_ts"].

shankari commented 3 years ago

It is good to refactor the existing spec details into two very similar classes: 1) retrieves data from server and stores to files 2) retrieves data from files into memory for notebooks to read from

both of these will have very similar functions, including:

However, I expect that the first script can stop at the evaluation_range level. The second script will go down to the eval_section_range level.

singhish commented 3 years ago

Refactored the script significantly today to accommodate unit testing. Much of this has involved reorganizing the script into functions that represent individual chunks of the pipeline parallel to the function calls occurring in PhoneView. Will keep at adding functionality to it tomorrow.

shankari commented 3 years ago

My suggestion is to heavily re-use the existing phoneview class. Concretely, phoneview already fills in data from the spec details class. The spec details class has retrieval code and a bunch of other code to find and match ground truth.

So we can create an abstract class for SpecDetails which has one function retrieve_data

In the script, you can then use PhoneView to retrieve all the data and then just dump the entries into files Small aside: there is a rudimentary batching solution in PhoneView (_read_until_done) to workaround https://github.com/psf/requests/issues/4937 which should get moved into ServerSpecDetails as part of the refactor

What am I missing here?

shankari commented 3 years ago

there are two ways to dump the files in the script:

e.g. in fill_battery_df, the entries are stored as battery_entries

    def fill_battery_df(self, storage_key):
        for phoneOS, phone_map in self.phone_view_map.items():
            print("Processing data for %s phones" % phoneOS)
            for phone_label in phone_map:
                curr_calibration_ranges = phone_map[phone_label]["{}_ranges".format(storage_key)]
                for r in curr_calibration_ranges:
                    battery_entries = self.spec_details.retrieve_data_from_server(phone_label, ["background/battery"], r["start_ts"], r["end_ts"])
                    ...
                    r["battery_entries"] = battery_entries
                    ...

you should be able to walk the tree and just dump out the XXX_entries lists This has the minor advantage that we leave open the option of using ServerSpecDetails in the notebook if we ever go back to making the data available through a server.

shankari commented 3 years ago

Concretely:

Is there a problem with this?

    "for phone_os, phone_map in pv.map().items():\n",
    "    print(15 * \"=*\")\n",
    "    print(phone_os, phone_map.keys())\n",
    "    for phone_label, phone_detail_map in phone_map.items():\n",
    "        print(4 * ' ', 15 * \"-*\")\n",
    "        print(4 * ' ', phone_label, phone_detail_map.keys())\n",
    "        # this spec does not have any calibration ranges, but evaluation ranges are actually cooler\n",
    "        for r in phone_detail_map[\"evaluation_ranges\"]:\n",
    "            print(8 * ' ', 30 * \"=\")\n",
    "            print(8 * ' ',r.keys())\n",
    "            print(8 * ' ',r[\"trip_id\"], r[\"eval_common_trip_id\"], r[\"eval_role\"], len(r[\"evaluation_trip_ranges\"]), arrow.get(r[\"start_ts\"]))\n",

Basically, add a save_json_file line just below the print statement.

shankari commented 3 years ago

@singhish I also don't see "Small aside: there is a rudimentary batching solution" (from https://github.com/MobilityNet/mobilitynet.github.io/issues/18#issuecomment-801174719) addressed in the current PR

shankari commented 3 years ago

at least for transitions_to_ranges, we want to test:

shankari commented 3 years ago

I volunteered to set up the scaffolding for the test harness that @singhish can use for his code. The e-mission server code uses unittest and it has worked pretty well so far.

In the e-mission-server repo, my structure is:

emission
 - tests
 - .....

I don't remember why the tests are not in a separate top-level directory. I double checked the repo, and this has been true since the initial push https://github.com/e-mission/e-mission-server/commit/118e5f4d9c2679965770ecb7b624c38d7a7df4ad

Let's keep the tests in a separate top-level directory this time, and use it as a template to go back and fix the old code.

shankari commented 3 years ago

Doesn't look too hard - just need to set the PYTHONPATH correctly https://stackoverflow.com/questions/1896918/running-unittest-with-typical-test-directory-structure

Couple of design thoughts:

Going to stick with unittest for now

singhish commented 3 years ago

As mentioned over Teams -- note to work through design decision regarding how to name the key folders that are dumped while traversing a PhoneView map

shankari commented 3 years ago

ok, can you repeat the challenges here for context? Not everybody will have access to the discussion over Teams.

singhish commented 3 years ago

Yes -- so as things currently stand, the key directory that gets dumped during the PhoneView map traversal section of the script is a key in the form XXX_entries. The issue here is that these keys get passed into FileSpecDetails's retrieve_data function, which is called by PhoneView to gather the requisite data that was originally done with the old retrieve_data_from_server function in the original SpecDetails class. However, the key_lists that were originally being passed in corresponded to time series keys that are passed in as an argument during POST calls to emission server. Thus a design decision needs to be made regarding the naming of the key folders that are dumped.

shankari commented 3 years ago

the solution is fairly simple. You just need to map the phoneview keys to the REST API keys.

The obvious way to do this is to the change the keys in the phone view so that the mapping can be determined programmatically - e.g. instead of location_entries, have the key be background_location_entries

Then you can have the file name during the load/save be something like "/".join(pv_key.split("_")[:-1])

>>> "/".join("background_location_entries".split("_")[:-1])
'background/location'

An alternative is to have a map between the PhoneView entry names and the REST API keys - e.g. location_entries -> background/location ,and have the PhoneView (through a function) return the correct key depending on the type of the SpecDetails. This is suboptimal since we would need to maintain that map.

A third alternative is indeed to dump the data in the retrieve_data_from_server call. This is also suboptimal because it is not modular - we retrieve and dump as part of the same call.

shankari commented 3 years ago

The chosen solution was the first one, with the caveat that the keys would be listed as constants at the top of the file. @singhish can you confirm?

Note that you can also add the dump code into retrieve_data_from_server as a stopgap measure to ensure that the phoneview based dump actually does dump all the entries properly.

Please remove the stopgap before checking in.

singhish commented 3 years ago

@shankari yes. constants still need to be added but the first solution has been implemented. the stopgap has now been removed.

singhish commented 3 years ago

FileSpecDetails and PhoneView mostly work now, but running into problems with the read_until_done function -- after double checking that both evaluation_ranges and calibration_ranges were being dumped by the script, data for start_ts=1564026015 and end_ts=1564026017 for key=background/location and user= ucb-sdb-android-1 is showing up as nonexistent for train_bus_ebike_mtv_ucb. Will address this tomorrow in our meeting @shankari

shankari commented 3 years ago

This is because the current retrieve_data_from_server retrieves one batch of data. The server limits responses to 10,000 entries. So at least for the accuracy controls, we may need to pull multiple batches from the server.

This is not a consideration for the file, since we dump all the entries (after batching) to one file and the file read does not have/require a batching mechanism. Note that this is another advantage of using the principled approach for dumping data, because otherwise we would have ended up with multiple batches of files dumped, which would have been more confusing to the user.

To fix this, you need to call read_until_done only for the ServerSpecDetails as outlined in https://github.com/MobilityNet/mobilitynet-analysis-scripts/pull/54#discussion_r598304353