LenkaV / CIF

Composite Indicators Framework for Business Cycle Analysis
GNU General Public License v3.0
56 stars 32 forks source link

pipelineEvaluation() - parameters for specific countries? #23

Closed kuritzen closed 2 years ago

kuritzen commented 2 years ago

Hi,

Thanks for a really nice package. I have been working with this quite a lot and I wonder if the parameters differs somehow between each country when calculating the cli and choosing which series it should be based on?

pipelineEvaluation for example has a parameter named weight which i suppose could differ for separate countries?

Thanks in advance

LenkaV commented 2 years ago

Hello,

yes, you are correct, the weights can differ for separate countries. I have added the full text of my thesis, so you can check how I have worked with the weights in CIF.

The OECD theory is described in chapter 2.3.2 Evaluation and selection of component series on page 47:

'The number of selected series may differ across the countries and depends on the expert judgement of the data analyst. OECD (2010b) and other authors define the selection criteria very vaguely, see chapter 4 for more information. None of the available sources, to my best knowledge, provides the clear set of instructions on prioritizing between often contradictory goals: e.g., minimizing the number of missed and extra turning points while maximizing the leading time and cross correlation coefficient of the newly constructed CLI. This part of the construction therefore requires substantial individual choices and expert knowledge from the researchers.'

Please check especially the section 'Criteria prioritization' from chapter 4.5.4 Selecting component indicators on page 92:

'As the instructions published in OECD papers are not clear or unambiguous, the algorithm lets users set weights in order to prioritize between contradictory goals: e.g., to prefer series with minimum missing turning points and maximal mean lead, while paying less attention to the number of extra signals. The default values of these weights are chosen (...), so the users can get initial results without any unnecessary interventions. However, the system is fully parametrized and the weights can be easily changed if the obtained results are of insufficient quality.'

And chapter 6.3 CLI performance on page 119:

'The function pipelineEvaluation() takes parameter weights as an input. The following analyses work with the default set of weights, which usually gives the most similar results to the OECD CLIs. The default weights are as follows: number of missing turning points = 0.25, number of early missing turning points = 0.05, number of extra turning points = 0.15, mean lead time = 0.15, standard deviation of lead time = 0.00, coefficient of variation of lead time = 0.10, cross-check = 0.15, maximum of correlation coefficient = 0.15. However, the default settings aren’t optimal for some of the analysed countries (Japan, Mexico and South Africa), so altered weights are used. Such indicators are marked as weights adjusted in the tables. The new settings are discussed in more detail in subsection 6.3.2.'

And here you can check the description of the pipelineEvaluation() function:

def pipelineEvaluation(df1, df2, missing, missingEarly, extra, time, checkCorr = True, maxInd = None, evalOnly = False, weights = [0.25, 0.05, 0.15, 0.15, 0.00, 0.10, 0.15, 0.15]):

Pipeline to choose the best individual series for composite leading indicator (computing
number of missing turning points (regular and early), number of extra turning points,
mean lead time, median lead time, standard deviation of lead time, coefficient of variation
of lead time, maximum of correlation coefficient, position of maximum of correlation
coefficient, sanity check (= difference between position of maximum of correlation
coefficient and median lead time)). With evalOnly = False, the weights are added
to each of these criteria to rank the individual series and select the best.

Parameters
-----
df1: pandas.DataFrame
    pandas DataFrame (with one column), values of reference series (gold standard)
df2: pandas.DataFrame
    pandas DataFrame, individual indicators to be compared with reference series
missing: pandas.DataFrame
    pandas DataFrame, missing turning points indicators (result of matchTurningPoints())
missingEarly: pandas.DataFrame
    pandas DataFrame, missing early turning points indicators (result of matchTurningPoints())
extra: pandas.DataFrame
    pandas DataFrame, extra turning points indicators (result of matchTurningPoints())
time: pandas.DataFrame
    pandas DataFrame, time of the turning points indicators (result of matchTurningPoints())
maxInd: int or None
    how many indicators should be returned at most (default None returns all that pass the conditions)?
checkCorr: bool
    should the highly correlated individual series be ignored (default True)?
evalOnly: bool
    if True, return only evaluation matrix; if False (default), return evaluation matrix
    with added total column (total rank), evaluation matrix of selected indicators and
    vector of selected columns
weights: list
    weigths of 8 criteria:

    - number of missing turning points
    - number of missing early turning points
    - number of extra turning points
    - mean lead time
    - standard deviation of lead time
    - coefficient of variation of lead time
    - sanity check
    - maximum of correlation coefficient

    the sum of these weights should be equal to 1 for easier interpretation of the results,
    but this is not necessary; weights parameter is ignored when evalOnly = True

Returns
-----        
totalEval: pandas.DataFrame
    dataframe with evaluation metrics of all series
selectedEval: pandas.DataFrame
    dataframe with evaluation metrics of selected series, returned, if
    evalOnly = False
selectedCol: pandas.indexes.base.Index
    names of selected series, returned, if evalOnly = False

I hope this helps. If you have more questions, let me know.