Open shankari opened 3 years ago
For CSV, we would need to split out the entries for a particular day into multiple csvs, one for each type of data. We would also need to flatten the object structure. Examples of flattening the object structures are here: https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/builtin_timeseries.py#L269 which calls https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/builtin_timeseries.py#L149
This is definitely lower priority than getting the JSON working end to end, but I wanted to document it for the record.
one potential solution for the JSON is: concatenate the input parameters for every retrieval call to represent the filename that has the data.
So if we retrieve data for:
manual/evaluation_transition
then we create a JSON file called data_manual-evaluation_transition_111111_22222
we go through and do this for all API calls that we currently execute
we can then replace the name of retrieve_data_from_server
to read_data_from_file
and replace the implementation from a REST API call to a file read.
we can then move the retrieve_data_from_server
code to a script that pulls data from a server and creates files
So two main changes:
the script can be in bin
- I think there may be a similar script that pulls data and stores it to a local mongodb. we would want to store to a file instead.
bin/dump_data_to_file.py
)Required flags:
--user
: the user string with which to retrieve data. This could be either shankari@eecs.berkeley.edu
or a device name, depending on the endpoint.--endpoint
: the endpoint to retrieve data from, e.g. manual/evaluation_transitions
.Optional flags:
--datastore-url
: the location where E-Mission server is running. This defaults to a local instance running at http://localhost:8080
, though this could also be a public URL.--start-ts
: the starting timestamp from where data is queried. The default value here is 0
.--end-ts
: the ending timestamp to where data is queried. The default value is the result of arrow.get().timestamp
.The output of the server call as specified by the input. This will be an output file in the format data_{endpoint}_{start timestamp}_{end timestamp}.json
that contains the JSON data returned from the server call.
E.g. if your input flags are --user shankari@eecs.berkeley.edu
and --endpoint config/evaluation_spec
, the output file will be named data_config~evaluation-spec_0_16593840.json
.
manual/evaluation_transition
background/battery
background/motion_activity
background/location
background/filtered_location
statemachine/transition
It is worth noting that the original data retrieval mechanism in spec_details.py
makes use of an object to maintain information about the spec ID and the datastore URL. The spec ID is not needed to retrieve data from the server, so it is omitted from this script.
I would be a little careful about terminology. manual/evaluation_transitions
is not really an endpoint in the classic REST API sense (we don't call GET /foo/bar/endpoint
), it is a timeseries key that, along with the start and end timestamps, is a parameter to the POST command.
Also, as we discussed today, please have a --all
argument that retrieves all the data associated with a spec in the way the current code does.
Spent some time today thinking through this -- is this updated design doc satisfactory? I realized having an --all
flag complicates things; we can just control the behavior of the script depending on the parameters that are passed in:
bin/dump_data_to_file.py
)dump_data_to_file.py
can accept optional inputs as command line parameters to control the data being dumped to JSON files. They are classified into independent parameters and co-dependent parameters, the latter of which must all be specified if one is specified.
Independent parameters:
--datastore-url
: the location where E-Mission Server is running. This defaults to a local instance running at http://localhost:8080
, though this could also be a public URL.--spec-id
: the spec ID for which to pull data for. If not specified, data for every spec ID associated with the E-Mission Server instance being used is retrieved.Dependent parameters:
--key
: the specific time series key to pull data from via E-Mission Server's REST API interface. If not specified, the script will pull data from every single time series key, as done during the initialization of a PhoneView
object.--user
: the user string with which to pull data with. This parameter is dependent on the time series key, and consequently must be specified if --key
is specified.--start-ts
: the starting timestamp from which data is queried.--end-ts
: the ending timestamp from which data is queried.dump_data_to_file.py
outputs JSON files with the following naming format:
data_{spec id}_{time series key}_{start timestamp}_{end timestamp}.json
The number of JSON files in this format is dependent on the parameters that are passed in. Specifically, a JSON file is outputted for each combination of spec ID and time series key in conjunction with the timestamp ranges that are explored at each step of the process. The algorithms to accomplish this are the same those used in the SpecDetails
and PhoneView
classes.
I think this is enough to be going on with. have you already started implementing in parallel?
Notes regarding the decision to parse command line arguments using the following pattern:
if any(arg in sys.argv for arg in ["--key", "--user", "--start-ts", "--end-ts"]):
parser.add_argument("--key", type=str, required=True)
parser.add_argument("--user", type=str, required=True)
parser.add_argument("--start-ts", type=float, required=True)
parser.add_argument("--end-ts", type=float, required=True)
argparse
does provide support for subparsers, as discussed last week. However, the issue with subparsers is that an extra flag must typically specified for them to work (as seen here: https://docs.python.org/3/library/argparse.html). They are moreso suited as switches to separate different pieces of functionality within a program, rather than as a means of specifying a subset of intended functionality as is being done here. This is why I am choosing to implement the additional parameters as done above.
This approach also eliminates the need for an --all
command, as not specifying either of the above four keys would assume "--all
" functionality.
A few updates regarding the current implementation of the script:
1) Added an object called Spec
to handle the information returned by the initial config/evaluation_spec
REST API call. This will probably evolve to some semblance of the original SpecDetails
object as I wrap up the phone_view
-related calls tomorrow. The important thing to note, however, is that this Spec
object decouples API calls from the flow of data, as SpecDetails
by design handled both.
2) It is worth noting that E-Mission stores every version of a spec that is uploaded to it. Thus, to handle specs with the same spec_id
, I am selecting the most recently uploaded spec based on the value of ["metadata"]["write_ts"]
.
It is good to refactor the existing spec details into two very similar classes: 1) retrieves data from server and stores to files 2) retrieves data from files into memory for notebooks to read from
both of these will have very similar functions, including:
However, I expect that the first script can stop at the evaluation_range
level. The second script will go down to the eval_section_range
level.
Refactored the script significantly today to accommodate unit testing. Much of this has involved reorganizing the script into functions that represent individual chunks of the pipeline parallel to the function calls occurring in PhoneView
. Will keep at adding functionality to it tomorrow.
My suggestion is to heavily re-use the existing phoneview class. Concretely, phoneview already fills in data from the spec details class. The spec details class has retrieval code and a bunch of other code to find and match ground truth.
So we can create an abstract class for SpecDetails which has one function retrieve_data
In the script, you can then use PhoneView to retrieve all the data and then just dump the entries into files
Small aside: there is a rudimentary batching solution in PhoneView (_read_until_done
) to workaround https://github.com/psf/requests/issues/4937 which should get moved into ServerSpecDetails as part of the refactor
What am I missing here?
there are two ways to dump the files in the script:
ServerSpecDetails
dump the files at the end of every call. This is not terrible since we do not anticipate anything other than the script using ServerSpecDetails
.retrieve_data_from_server
in phone viewfillXXX
andXXX_entries
e.g. in fill_battery_df
, the entries are stored as battery_entries
def fill_battery_df(self, storage_key):
for phoneOS, phone_map in self.phone_view_map.items():
print("Processing data for %s phones" % phoneOS)
for phone_label in phone_map:
curr_calibration_ranges = phone_map[phone_label]["{}_ranges".format(storage_key)]
for r in curr_calibration_ranges:
battery_entries = self.spec_details.retrieve_data_from_server(phone_label, ["background/battery"], r["start_ts"], r["end_ts"])
...
r["battery_entries"] = battery_entries
...
you should be able to walk the tree and just dump out the XXX_entries
lists
This has the minor advantage that we leave open the option of using ServerSpecDetails
in the notebook if we ever go back to making the data available through a server.
Concretely:
evaluation_range
or calibration_range
will have XXX_entries
- e.g. (battery_entries
, location_entries
... etc) as fields in them.phone_os
, phone_label
, evaluation_range
details (including id
, role
, start_ts
and end_ts
) availableIs there a problem with this?
"for phone_os, phone_map in pv.map().items():\n",
" print(15 * \"=*\")\n",
" print(phone_os, phone_map.keys())\n",
" for phone_label, phone_detail_map in phone_map.items():\n",
" print(4 * ' ', 15 * \"-*\")\n",
" print(4 * ' ', phone_label, phone_detail_map.keys())\n",
" # this spec does not have any calibration ranges, but evaluation ranges are actually cooler\n",
" for r in phone_detail_map[\"evaluation_ranges\"]:\n",
" print(8 * ' ', 30 * \"=\")\n",
" print(8 * ' ',r.keys())\n",
" print(8 * ' ',r[\"trip_id\"], r[\"eval_common_trip_id\"], r[\"eval_role\"], len(r[\"evaluation_trip_ranges\"]), arrow.get(r[\"start_ts\"]))\n",
Basically, add a save_json_file
line just below the print
statement.
@singhish I also don't see "Small aside: there is a rudimentary batching solution" (from https://github.com/MobilityNet/mobilitynet.github.io/issues/18#issuecomment-801174719) addressed in the current PR
at least for transitions_to_ranges
, we want to test:
I volunteered to set up the scaffolding for the test harness that @singhish can use for his code.
The e-mission server code uses unittest
and it has worked pretty well so far.
In the e-mission-server repo, my structure is:
emission
- tests
- .....
I don't remember why the tests are not in a separate top-level directory. I double checked the repo, and this has been true since the initial push https://github.com/e-mission/e-mission-server/commit/118e5f4d9c2679965770ecb7b624c38d7a7df4ad
Let's keep the tests in a separate top-level directory this time, and use it as a template to go back and fix the old code.
Doesn't look too hard - just need to set the PYTHONPATH
correctly
https://stackoverflow.com/questions/1896918/running-unittest-with-typical-test-directory-structure
Couple of design thoughts:
python -m unittest
will just work in python3
, which seems pretty straightforwardGoing to stick with unittest for now
As mentioned over Teams -- note to work through design decision regarding how to name the key
folders that are dumped while traversing a PhoneView
map
ok, can you repeat the challenges here for context? Not everybody will have access to the discussion over Teams.
Yes -- so as things currently stand, the key
directory that gets dumped during the PhoneView
map traversal section of the script is a key in the form XXX_entries
. The issue here is that these keys get passed into FileSpecDetails
's retrieve_data
function, which is called by PhoneView
to gather the requisite data that was originally done with the old retrieve_data_from_server
function in the original SpecDetails class. However, the key_list
s that were originally being passed in corresponded to time series keys that are passed in as an argument during POST calls to emission server. Thus a design decision needs to be made regarding the naming of the key
folders that are dumped.
the solution is fairly simple. You just need to map the phoneview keys to the REST API keys.
The obvious way to do this is to the change the keys in the phone view so that the mapping can be determined programmatically - e.g. instead of location_entries
, have the key be background_location_entries
Then you can have the file name during the load/save be something like "/".join(pv_key.split("_")[:-1])
>>> "/".join("background_location_entries".split("_")[:-1])
'background/location'
An alternative is to have a map between the PhoneView entry names and the REST API keys - e.g. location_entries -> background/location
,and have the PhoneView (through a function) return the correct key depending on the type of the SpecDetails. This is suboptimal since we would need to maintain that map.
A third alternative is indeed to dump the data in the retrieve_data_from_server
call. This is also suboptimal because it is not modular - we retrieve and dump as part of the same call.
The chosen solution was the first one, with the caveat that the keys would be listed as constants at the top of the file. @singhish can you confirm?
Note that you can also add the dump code into retrieve_data_from_server
as a stopgap measure to ensure that the phoneview based dump actually does dump all the entries properly.
Please remove the stopgap before checking in.
@shankari yes. constants still need to be added but the first solution has been implemented. the stopgap has now been removed.
FileSpecDetails
and PhoneView
mostly work now, but running into problems with the read_until_done
function -- after double checking that both evaluation_ranges
and calibration_ranges
were being dumped by the script, data for start_ts=1564026015
and end_ts=1564026017
for key=background/location
and user= ucb-sdb-android-1
is showing up as nonexistent for train_bus_ebike_mtv_ucb
. Will address this tomorrow in our meeting @shankari
This is because the current retrieve_data_from_server
retrieves one batch of data. The server limits responses to 10,000 entries. So at least for the accuracy controls, we may need to pull multiple batches from the server.
This is not a consideration for the file, since we dump all the entries (after batching) to one file and the file read does not have/require a batching mechanism. Note that this is another advantage of using the principled approach for dumping data, because otherwise we would have ended up with multiple batches of files dumped, which would have been more confusing to the user.
To fix this, you need to call read_until_done
only for the ServerSpecDetails
as outlined in https://github.com/MobilityNet/mobilitynet-analysis-scripts/pull/54#discussion_r598304353
retrieve_data
of ServerSpecDetails
to be retrieve_one_batch
.read_until_done
would then become retrieve_data
and would wrap retrieve_one_batch
read_until_done
calls from the phoneview.
We want to host this in OEDI, which will use AWS resources. So they basically need a easy to access machine-readable format.
We discussed two formats:
e-mission can already export data from mongodb in JSON format
However, our scripts currently don't load one day's worth of data at a time. Instead, we download the spec, and then use the spec to pull chunks of data across multiple days (that represent multiple repetitions of the timeline) into one notebook for comparison.
We need to: