Flatten data from JSON to a relational format

singhish commented 3 years ago

As it currently stands, the data is spread out across multiple folders in JSON format in the following structure:

    out_dir
    └── user
        └── spec_id
            └── key
                └── {start_ts}_{end_ts}.json

Note that all of the PhoneView-related data is in the following format:

{
        "_id": {
            "$oid": "..."
        },
        "metadata": {
            "key": "...",
            "platform": "...",
            "read_ts": ...,
            "time_zone": "...",
            "type": "...",
            "write_ts": ...
        },
        "user_id": {
            "$uuid": "..."
        },
        "data": {
            <keys are dependent on key specified in metadata>
        }
    }

Thus, I propose a series of Pandas DataFrames with the following columns:

1) A DataFrame consisting of metadata

user
spec_id
key
start_ts
end_ts
_id
platform
read_ts
time_zone
type
write_ts
user_id

2) A DataFrame for each key. The fields here correspond to the fields in the data sub-object of the PhoneView data files. For instance, a DataFrame for the background/battery key for android devices can have these columns:

_id
android_health
android_plugged
android_technology
android_temperature
android_voltage
battery_level_pct
battery_status
ts
write_ts

With the amount of data there is, though, Pandas DataFrames might be insufficient and clunky. A SQLite database might be more appropriate, as things such as primary/foreign keys (which _id can take the role of) can be used to better organize the data as a whole.

When all is said and done, we would end up with these tables:

Metadata
BackgroundBattery
BackgroundFilteredLocation
BackgroundLocation
BackgroundMotionActivity
ManualEvaluationTransition
StatemachineTransition

that would encompass all our data.

Thoughts? @shankari

shankari commented 3 years ago

There are multiple JSON files; are you proposing a single dataframe or multiple dataframes for each type? e.g. for battery, I count 8 JSON files

$ find  ~/e-mission/mobilitynet-analysis-scripts/bin/data/ -name \*battery\* | wc -l
       8

Will there be 8 dataframes or one? And if there will be one, how do you propose to capture the differences between the different JSON files?

Concretely, why didn't we just save all the JSON files to one giant battery.json? If we needed multiple JSON files, why don't we need multiple csv files/dataframes?

Note also that creating csv files is only the first part of the puzzle. As I indicated, you also need to be able to load and navigate the csv files - e.g. change the examples in the data exploration template to work with the new structure.

shankari commented 3 years ago

Just to clarify, I think that your overall direction is correct. But you need to think through the details. The trajectory evaluation notebook can actually give you a hint

shankari commented 3 years ago

creating csv representations of the files is good, but I am not sure it adds that much to the ease of exploration because people still have to navigate the tree to access individual data frames. In general, data scientists don't seem to like to work with trees - they don't know what the keys are, etc.

They want to load everything into a dataframe and then just work with dataframes forever. So I think that the flattening requirement is actually driven by the need to load a dataframe at the beginning of the notebook and then use it forever.

singhish commented 3 years ago

My initial understanding of this task was to re-represent the data in tabular format, as this would make it easier for users to deal with rather than JSON files. I initially thought that the reason why you were mentioning using Pandas DataFrames was because they are inherently tabular.

This would for sure change how the data is currently accessed, but I believe that having the data tabularized would make it easier to navigate. For example, you would be able to replace the tree traversals with simple queries.

shankari commented 3 years ago

For example, you would be able to replace the tree traversals with simple queries.

This is exactly what I had in mind, but I am not sure your proposed solution actually fully supports this. Concretely, in the data exploration template, "Work with a single trip" section, we traverse the tree to display the battery drain comparison for a single trip. How do you propose we do that with the battery_df or battery.csv structure?

If you want hints on how I did this before myself, look at the trajectory evaluation notebooks, in which I did create a dataframe of the locations, but with a few small additions.

singhish commented 3 years ago

I spent today studying Data_exploration_template and trajectory_evaluation to get an idea of how to answer this question. The general way of data is being accessed for trips is by indexing directly into a portion of the phone view map. For instance, this is accomplished for battery_df in Data_exploration_template as such:

curr_range = phone_detail_map["evaluation_ranges"][2]
...
battery_df = curr_range["battery_df"]

and likewise, in trajectory_evaluation, this is done directly in the test cases by picking out a section:

"section": pv_ucb.map()["android"]["ucb-sdb-android-3"]["evaluation_ranges"][0]["evaluation_trip_ranges"][0]["evaluation_section_ranges"][2]

Also, a (probably obvious) observation is that the dfs are stored in the phone view maps and are directly referenced to build up the spatial error dfs (eg. the location_df key is directly referenced in get_spatial_errors => sr['location_df'])

Not sure if this is on the right track, but maybe each of these dfs can be appended upon each other during the tree traversals to directly return data in .csv/tabular format? That approach is a lot more straightforward and works better with the current tree traversals than the original approach I suggested.

shankari commented 3 years ago

Not sure if this is on the right track, but maybe each of these dfs can be appended upon each other during the tree traversals to directly return data in .csv/tabular format? That approach is a lot more straightforward and works better with the current tree traversals than the original approach I suggested.

May be on the right track, but could you give some additional details? pseudocode is fine, PR is also fine if you can quickly backtrack from an approach that doesn't work.

Concretely, if you append the dfs to each other, what will the columns look like? how will you implement the "Work with a single trip"? Will you still traverse a tree?

singhish commented 3 years ago

Of course, we would need to add the path to a specific entry as column values in each df to uniquely identify records in the dataframes.

shankari commented 3 years ago

I would suggest the following:

create an alternate class that can read data from the directory file structure and generate pandas dataframes with the additional columns representing the path through the tree.
change the data exploration template to use the new class and method of reading the data. This will serve as an existence proof that it actually works

In the upcoming days, we can replace phoneview in the notebooks with the new class. And then we are done.

shankari commented 3 years ago

@singhish One challenge you may run into is that the file directory structure does not currently have all the information required to add the additional columns - e.g.

bin/data//ucb-sdb-ios-4/train_bus_ebike_mtv_ucb/background~battery

which implies that users can't load only the battery dataframe and work with it, for example.

singhish commented 3 years ago

Added a notebook here that successfully does what we discussed: https://github.com/singhish/mobilitynet-analysis-scripts/blob/flatten_json/bin/Flatten_JSON_to_tabular.ipynb. Let me know your thoughts! I can turn this into a script tomorrow. @shankari

shankari commented 3 years ago

This looks fine to me overall. I assume it won't be a script at the end, but rather, a module (similar to phoneview) that will basically embed the df_map, right?

But I still don't see how you can accomplish the original goal "Work with a single trip" section," from the data exploration template. Can you elaborate?

shankari commented 3 years ago

One other thing (not sure it is important) is that people may want to load one CSV manually (e.g. only battery or only location). but I guess they can just do it pandas directly. Let's revisit this after getting this iteration to work fully.

shankari commented 3 years ago

Concretely, the requirement is "I want to pick the third repetition of the berkeley timeline and visualize it (either battery or location)"

As an example of how this works with the trajectory (that I posted to the chat yesterday):

st_errors_df.query("phone_os == 'android' & (quality == 4) & section_id == 'light_rail_below_above_ground'").boxplot(column="error", by="run")

That gives me the android HAMFDC (quality == 4) data for the "light_rail_below_above_ground" leg.

shankari commented 3 years ago

In other words, you have got to the point where you can concatenate dataframes, but then you are just printing them. I want to see how you query the dataframes so that you can replicate the behavior in the data exploration template. where we visualize one trip or one section at a time.

singhish commented 3 years ago

Realized that the way to accomplish this was to append the values for everything in phone_detail_map["evaluation_ranges"]. Working on a solution for this at the moment and will have it ready to be pushed up tomorrow.

shankari commented 3 years ago

to recap what you said: in order to query subsets of data from the dataframe, the dataframe needs to have columns that represent the edges/nodes along the path while traversing the tree from PhoneView to access the dataframe currently.

shankari commented 3 years ago

I did look into python based tree libraries and found pytree, which would have allowed people to access the dataframes using something like ("/android/ucb_trip/2/HAMFDC/light_rail_underground") so basically specifying a path through the tree. But it didn't look like that's what data scientists want, what they want is a dataframe that they can query based on those keys instead.

@singhish will look to see if somebody has already come up with a library/sample code to flatten this way

singhish commented 3 years ago

Found a way to recursively generate the primary key. Prepending the key as columns to the dataframes is tougher than expected though. I'll discuss with @shankari further on Monday.

This is what the primary key is, however:

{'$oid',
 '$uuid',
 'device_manufacturer',
 'device_model',
 'device_version',
 'duration',
 'end_ts',
 'eval_common_trip_id',
 'eval_role',
 'eval_role_base',
 'eval_role_run',
 'key',
 'platform',
 'plugin',
 'read_ts',
 'role',
 'spec_id',
 'start_ts',
 'time_zone',
 'trip_id',
 'trip_id_base',
 'trip_run',
 'type'}

shankari commented 3 years ago

I don't think you need the metadata information, in particular:

{'$oid',
 '$uuid',
 'read_ts',
 'time_zone',
 'type'}

The metadata is primarily for queries that don't want or need to know the structure of the data.

shankari commented 3 years ago

@singhish my understanding is that you are currently using the PhoneView to create the dataframes. Is this the long-term plan, or will we eventually deprecate/remove the existing PhoneView?

Hint: I prefer the latter, so am wondering what the long-term strategy for creating the dataframes would be.

shankari commented 3 years ago

@singhish the current version of the code looks fine. Can you please

put it into a module in emeval/input and
change the data exploration template to use it so that we can verify that the generated dataframe and the module interface are consistent with the current exploration?

Once that is done, we can change all the notebooks to use dataframes directly.

We can keep the tree and dataframe access methods around for a while, although if we have time at the end as a stretch goal, we can save the dataframes to csv files for upload as well.

While it would be ideal to remove the PhoneView, as long as the boilerplate at the top of the notebook generates a dataframe we are probably good enough for now. It is way more important to ensure that we can use dataframes consistently in the notebooks so people can understand them better.

singhish commented 3 years ago

Dataframes are now successfully being created from a PhoneView map, as demonstrated in the updated Flatten_JSON_to_tabular notebook. I realized my initial recursion-based approach wasn't working after I exported and serialized a full pv.map() output from a train_bus_ebike_mtv_ucb-based PhoneView object to JSON and viewed it in Dadroit, a JSON viewer specifically optimized for massive JSON files.

My issues with the recursion-based method:

The primary key generation algorithm I was using was overkill. The only places *_df keys are found in are in the calibration_ranges and evaluation_ranges sub-trees, which are 3 levels into a pv.map() output (<phone_os>-><phone_name> -> <*_ranges>). As such, the primary key only needs to be the following:
```
{"role",
"range_type" (either "calibration" or "evaluation"),
<each key in an element of a *_ranges list that isn't a dict, list, or pd.DataFrame>
```
The primary key generation was picking up keys that were stored as outputs from calls to the E-MIssion Server instance made during the data dumping process, such as the output from manual/evaluation_transition. It is difficult to include these into the already existing *_df objects as there is a mismatch in the number of elements in the _ranges subfields and the number of returned records from these calls (as seen in the screenshot below):
It might be useful to store separate dataframes for the evaluation_transitions and calibration_transitions, however (@shankari let me know your thoughts). However, it would have been impossible to "intelligently" build the dataframes on the fly recursively anyway due to the size/depth discrepancies deeper into the PhoneView map.

I plan on wrapping the functionality in theFlatten_JSON_to_tabular notebook into a module located under emeval.input and implementing it with the Data_exploration_template by EOD Monday. This method is more reliant on the PhoneView class than originally intended to be (I was trying recursion at first to make the method more general) but is fast and functional. Using this method, 10 dataframes are returned -- 5 iOS (battery, location, filtered_location, motion_activity, transition) and 5 Android (same as iOS) -- in a dictionary containing these dataframes. I split the dataframes up by phone OS due to the differences in the columns that were originally there between the OSes.

Let me know your thoughts @shankari.

shankari commented 3 years ago

To clarify, the phoneview tree has dataframes in multiple levels. I think we have talked about this earlier while indicating that you need to "work with a single trip" or "walk through the tree to find a single leaf node".

The evaluation ranges represent an entire evaluation timeline (e.g. mountain view to berkeley and back)
The evaluation trip ranges represent the subset of the evaluation range that corresponds to a single trip in the timeline (e.g. mountain view to berkeley)
The evaluation section ranges represent the subset of the evaluation trip range that corresponds to a single unimodal segment (aka leg in the spec, aka walking to the train station)

In terms of options:

You can build up the dataframes by concatenating from the leaves, but that is problematic because it will not capture the data in the interstices (e.g. while waiting for the train, or between trips).
It would be better to use the evaluation range dataframe and just concatenate the labels for the entries that are part of a trip or section. The other points will have N/A for those fields.

shankari commented 3 years ago

@singhish If I have a trip that lasts from start_ts=t1 to end_ts=t100, and there is an entry (e.g. location or battery or...) with ts=t25, will the row for t25 have the column for trip_range_id filled in, or will it be NA? It should have the trip_range_id.

So you should be able to check this by querying start_ts < ts < end_ts for any trip with (start_ts, end_ts) and confirming that all of them have trip_range_id column not NA

shankari commented 3 years ago

As a follow-on, if that is true, then if you want to find the ground truth for a trip, you don't want to use all entries whose trip_range_id is not NA.

Concretely, in the example above, if I have entries every second, the dataframe will have ~ 98 entries that all have the same, non-NA, trip_range_id and the same trip_range_start_ts and the trip_range_end_ts. So you will have 98 copies of the ground truth, which is not what we want.

This may be why you are seeing multiple lines, a simple unique() should fix it in that case.

shankari commented 3 years ago

actually represent all the fields from the original phone view display. This will ensure that we have all the phone view data in the dataframe and will also make the output meaningful instead of looking like a bunch of copied values
figure out why the walk_university shows up at the end instead of in its proper place between the trip to berkeley and the trip back
use get_geojson_for_loc_df instead of get_geojson_for_section to actually display the maps

shankari commented 1 year ago

The PR related to this has been outstanding for over a year now so I have merged it. But it still has a bunch of outstanding cleanup.

PR related to this with lots of unaddressed comments: https://github.com/MobilityNet/mobilitynet-analysis-scripts/pull/57

MobilityNet / mobilitynet.github.io

Flatten data from JSON to a relational format #26