QuickPay-Operational-Performance / Data-and-code

Data and code for econometric analysis
0 stars 0 forks source link

Create Project_Delay column #6

Closed vob2 closed 4 years ago

vob2 commented 4 years ago

The logic should be to take the last entry for period_of_performance_current_end_date variable and subtract the first entry for this variable.

But the question is if the last entry corresponds to the actual completion of the project. How do we know if and when project has been completed?

vibhuti6 commented 4 years ago

I have created a python function to obtain project delays for any given transactions file.

For details, see _quickpaydatacleaning.py in https://github.com/QuickPay-Operational-Performance/Data-and-code

I think we will have to rely on some ad-hoc measures to determine if the project has been completed. For example, if there are no further transactions for a project that started in 2010, it may be reasonable to assume that it is complete. Please also see my response to #4

vob2 commented 4 years ago

Here is a copy of the code (for the record) def calculate_delays(df):#,path_to_dictionary_csv): df=df.copy(deep=True)#copy so that the input dataframe is not altered ###########################

Clean transactions file

###########################
df=df.drop_duplicates()
df=clean_text_columns(df)
df=convert_to_date_time(df)
if not 'contract_award_unique_key' in set(df.columns): # if column does not exist
    df=create_combined_prime_id(df)    
    id_name='combined_id_prime_contract'
else: 
    id_name='contract_award_unique_key'
##############
# Get Delays #
##############
df_subset_earliest=df.sort_values(by='action_date').drop_duplicates(subset=id_name)
# get columns corresponding to earliest action date for each contract
df_subset_earliest=df_subset_earliest.rename(columns={'period_of_performance_current_end_date':'initial_end_date'})
# rename completion date column to denote the initial end date of the project
df_subset_earliest=df_subset_earliest[[id_name,'initial_end_date']]

df_subset_latest=df.sort_values(by='action_date',ascending=False).drop_duplicates(subset=id_name)
# get columns corresponding to latest action date for each contract
df_subset_latest=df_subset_latest.rename(columns={'period_of_performance_current_end_date':'eventual_end_date'})
# rename completion date column to denote the eventual end date of the project
df_subset_latest=df_subset_latest[[id_name,'eventual_end_date']]

delays_df=pd.merge(df_subset_latest,df_subset_earliest,on=id_name)
delays_df["days_of_change_in_deadline_overall"]=(delays_df.eventual_end_date-delays_df.initial_end_date).dt.days

return delays_df