My current implementation of this was very basic and intended solely to avoid having to requery directions links to retrieve SVG path data. A path to a feature_df.csv from a previous run (containing the SVG path data and any initial processing for any links) can be provided and load in place of reprocessing any features available from the features_df.csv. This saves a lot of time with the current implementation of querying data for directions links (See: #1 ).
Issue:
The current implementation would fail if the input data source changes (changing the unique ID assigned to project-link or feature combinations). It also can only utilize data from a single previous run.
Possible solutions:
This really depends how far we want to go to deal with this. If querying directions links was faster, I would likely suggest we forego this issue and just process the data freshly each build. That said, I could imagine there being cases where accessing cached data could be useful (e.g., OSM features changed and we want to use a specific version from an old build).
One extreme would be to build out a full caching system based on the unique build, input data, TUFF ID, link, etc. I am not sure how we would want to go about specifying what cached data to use and what not to. This would likely be a substantial over engineering for this application.
Another approach I considered during the initial implementation was to create a separate script that just merges any number of previous builds. This would be more hands on and require someone to know what subset of data was processed in each build. But it would also be a convenient way of adding new data to an existing dataset without reprocessing the old data.
For example: You could run a full build of input_data_01.csv, then the underlying data is updated to include new projects as input_data_02.csv. The existing projects could be filtered from input_data_02.csv and only the new projects processed, then the results from build 1 and build 2 are merged.
This use case would get more complicated if subsets of existing data were updated, rather than new data simply added.
Ultimately I will likely leave this until the next update is needed and see what will be useful in practice based on data update patterns.
Current implementation:
My current implementation of this was very basic and intended solely to avoid having to requery directions links to retrieve SVG path data. A path to a
feature_df.csv
from a previous run (containing the SVG path data and any initial processing for any links) can be provided and load in place of reprocessing any features available from thefeatures_df.csv
. This saves a lot of time with the current implementation of querying data for directions links (See: #1 ).Issue:
The current implementation would fail if the input data source changes (changing the unique ID assigned to project-link or feature combinations). It also can only utilize data from a single previous run.
Possible solutions:
This really depends how far we want to go to deal with this. If querying directions links was faster, I would likely suggest we forego this issue and just process the data freshly each build. That said, I could imagine there being cases where accessing cached data could be useful (e.g., OSM features changed and we want to use a specific version from an old build).
One extreme would be to build out a full caching system based on the unique build, input data, TUFF ID, link, etc. I am not sure how we would want to go about specifying what cached data to use and what not to. This would likely be a substantial over engineering for this application.
Another approach I considered during the initial implementation was to create a separate script that just merges any number of previous builds. This would be more hands on and require someone to know what subset of data was processed in each build. But it would also be a convenient way of adding new data to an existing dataset without reprocessing the old data.
input_data_01.csv
, then the underlying data is updated to include new projects asinput_data_02.csv
. The existing projects could be filtered frominput_data_02.csv
and only the new projects processed, then the results from build 1 and build 2 are merged.Ultimately I will likely leave this until the next update is needed and see what will be useful in practice based on data update patterns.