Closed cczhu closed 4 years ago
Had a rethink of this task, and now realize why Arman was reluctant to start mucking around with data imputation of permanent count stations, especially if he already had PECOUNT working to fill in missing ~2-3 months of data.
daily_traffic * D_ijd
for all available data in that year, given that's what CountMatch suggests doing, or just using the MADT patterns from the closest year).Revised class design:
DerivedVals
base class that stores methods for calculating derived properties with no imputation. Since imputation is likely to be a strategy and not a single step in a data reduction pipeline, more sophisticated derived value calculations could subclass this base class.GrowthFactor
class that stores methods for calculating the growth factor, and results of the particular fit.PermCount
class that stores both processed data and results/diagnostics of DerivedVals
GrowthFactor
instances within it. If it turns out we don't need to store anything from DerivedVals
or GrowthFactor
other than the results, we can mount them in PermCountProcessor
instead of here (in which case PermCount
will only be a useful indicator that the data structure is different than a regular count).
Imputed
column is enough (we could also use a masked DataFrame, but storing the mask would be more expensive than storing the column, and it would look different than for the derived properties). For derived properties that are averaged over multiple days, store the number of days, and -1
if the value was imputed.PermCountProcessor
class that cycles through every count, first determining if it meets the requirements for being a PTC, then running the derived value and growth factor calculations. If it turns out we don't need to store anything from DerivedVals
or GrowthFactor
other than the results, we can mount them here rather than in PermCount
.Revised work plan:
locations x years
could be made into permanent count location-years if we started relaxing the criteria to be a permanent station. If there are tons of locations, or tons of years in current permanent count locations, that would be considered permanent with a small relaxing of criteria, it would justify further work on imputation.Outstanding question: if DerivedVals
and GrowthFactor
classes have lots of method-specific flags, how should we control for these in config.py
? Nested dicts?
As I'm refactoring to allow for imputation and other multi-year preprocessing of permanent counts, I've created a design where all data from a count location across multiple years are stored in the same object. The amount of data taken at a location can drastically change from year to year, and the permanent count criteria are defined in terms of a single year's data, so currently each location's data is checked year by year using PermCountProcessor.partition_years
. The output of this function is perm_years
, a list of years that meet the permanent count criteria. Once a PermCount
object is created for the count location, perm_years
is stored within it, and is used in a number of subsequent methods for calculating derived values and growth factors.
There are a number of issues with this design:
perm_years
can freely be modified by the user, but it's a property governed by the general criteria they set.partition_years
is in PermCountProcessor
because that class handles partitioning counts into STTCs and PTCs, but I'm not a fan of a class instance's fundamental properties being calculated by other classes.perm_years
changes due to iterative imputation.Alternative design:
perm_years
a (lazy?) property of PermCount
, and partition_years
a method.
perm_years
should be calculated in __init__
, and raise an error if perm_years
has zero length. from_count_object
should then initialize an instance using self = cls(...)
in the middle.PermCountProcessor
should try to initialize a permanent count instance for every single count location in a try/except block, and create a short term count if the initialization fails._py
.Decided to adopt points 1 and 3, but not 2. perm_years
is now an instance property, but it is still initialized by PermCountProcessor
. This is because practically everything else is initialized by PermCountProcessor
(all derived values and growth factors, for example), so it would be strange to use an entirely different pattern to attempt initialization of perm_years
, especially since it needs to communicate back to its parent whether it succeeded in making a permanent count. We can always refactor later.
Resolved by #37
Based on comments in #25 and discussions with @aharpalaniTO, we'll be revising how we treat permanent counts. To do this, we can no longer identify and process permanent count locations while reading in data year-by-year, since data imputation and outlier detection for permanent count stations requires data from other years.
To resolve this:
AnnualCount
. These will eventually go intoAnnualCount
toRawAnnualCount
, since it's only used byReader
and gets discarded as soon as multiple years are combined into single multi-index tables.Reader
so that it no longer distinguishes between PTCs and STTCs - all locations will be handled the same way.growthfactor.py
intopermcount.py
, which will handle all permanent count processing, including outlier detection. We'll move methods into their own files in a subsequent issue ifpermanentcount.py
gets too long. This will also solve one point in #26 .