Closed hugo-e-pigott closed 2 years ago
- Will share the CSV that is immediately downloaded from the Customs system for April. Will delete the identifiable data. Will have to write code to potentially delete these columns and rows. Potentially do this at the end and produce two csv's, one that is anonymised and another with individual information
Is this the SEC_PROC_ASY_AnnoymisedRawDataAndReferenceTables_30-04-21.csv
file you shared in Skype? We'll also need to the formatted tables up until March 2021 - sorry if you have already sent this and I missed it! 😊
- Converting SITC to character (line 57), SITC Code 8431.2 isn't in convertible format. Anna checked this in HS Code tables that have corresponding SITC codes, and in fact SITC code should be 0744.92. Have gone back into raw data and changed. Now running no issues
Great! 👍
- Explore extent of missing data (line 84), checking for large amounts of missing data (>0.1). Can we also look at missing data if only a few cells, maybe create data-frames for each column with missing data 9similar to PRF data-frame)?
The numberMissing
and proportionMissing
vectors provide the number of missing values and proportion of missing values in each column. Here's the numberMissing
variable:
Office Reg.Ref Reg..Date Type CP4 CP3 ...
0 0 0 0 0 0 ...
- Extract 3 and 1 digit SITC Codes (line 100 & 101), two null or NA values in SITC column skipped (HS Codes- 71189000 & 99050000). Checked HS Tables, document states HS Code 71189000 corresponding SITC is null (will have to confirm with Customs), but HS Code 99050000 corresponding SITC is 0990.15- have gone back into raw data and changed. Now only one running issue.
Yes, I think these are carried through from NA
values in the SITC column.
- NAs present in any of the above columns (line 104), code View(tradeStatsCommodities[is.na(tradeStatsCommodities$column_name), ]) may not be working
This is an example piece of code - you'll need to replace column_name
with the name of column that you're interested in. I've added an explanation into the comment.
- Been through checkingMergingColumnsForClassificationTables() function, would it be possible to quickly explain how this works? Also not sure of the format of the data-frame produced "missingValuesHSCode_2" could explain also?
The checkingMergingColumnsForClassificationTables()
function asks two questions:
Based on the questions the function will warn the user if any values missing (in each direction of the question), it will also return the missing values in a list structure (missingValuesHSCode_2
for the check on the mode of transport merge):
$NotPresentInClassificationTable
character(0)
$NotPresentInTradesData
character(0)
In the above example no missing values were identified so the NotPresentInClassificationTable
and NotPresentInTradesData
slots of the list are empty (character(0)
).
Because missing values were identified in the merge column of the trades data when compared to the HS 2 classification table, the missingValuesHSCode_2
looks like this:
$NotPresentInClassificationTable
character(0)
$NotPresentInTradesData
[1] "13" "26" "43" "47" "50" "53" "55" "75" "77" "80" "81" "93"
- Merging SITC with cleaned data (line 158). Warning message 1 code in SITC cleaned data that not in classifications. Again not sure of format of data-frame "missingValuesSITC_1" (this to be similar to other merges)
The merging on SITC for me reports 2 missing values so missingValuesSITC_1
has the following content:
$NotPresentInClassificationTable
[1] NA NA
$NotPresentInTradesData
integer(0)
telling us that there are two values (both NA
s) that aren't in the SITC classification table. We could update the checkMergingColumnsForClassificationTables()
function to ignore NA
values and also only report unique values?
- Merging Import with cleaned data (line 171). Warning message "185 codes in the "CO" column of the Import country classification table are not in the "CO" column of the trades statistics table"- very possible.
Yes, we can discuss these warning messages - I think it's good to have so we're always checking the classification tables but in many cases missing data are expected because classification tables often represent a subset of the data.
- Merging export with cleaned data (line 184). Warning message "194 codes in the "CE.CD" column of the Export country classification table are not in the "CE.CD" column of the trades statistics table"- very possible
As above.
- Merging principle commodities with cleaned data (line 198). Warning message "378 codes in the "HS.Code" column of the trades statistics table are not in the "HS.Code" of the Principle Commodities classification table". Need to invest age further and add missing information to classifications table. Also how is this possible when merge previously change in observations is only about 5? Another warning message "3759 codes in the "HS.Code" column of the Principle Commodities classification table are not in the "HS.Code" column of the trades statistics table"- very possible
As above.
- Merging BEC values with cleaned data (line 211). Warning message "611 codes in the "HS.Code_6" column of the trades statistics table are not in the "HS.Code_6" of the BEC descriptions classification table". Need to invest age further and add missing information to classifications table. Another warning message "3769 codes in the "HS.Code_6" column of the BEC descriptions classification table are not in the "HS.Code_6" column of the trades statistics table"- very possible
As above.
- For merging principle commodities and BEC have to think through strategy to check missing with UN outline We should revisit this at our next catch up - if there's a UN BEC table available a quick check should be pretty straightforward.
- Check number of rows with NAs present (line 218), not sure what this code is doing- is it a follow on from line 104, highlighting PRF NAs?
This code is looking at rows instead of columns and ignoring the PRF column because almost all rows have an NA
in this column. The rowsWithNAValues
data.frame is a subset of the trades data of rows containing NA
values in any column except the PRF
column. Not sure if this is useful, if not we could remove?
- Tried to run code (line 235) checking commodity values against expected values and get error "Error in match.names(clabs, names(xi)) : names do not match previous names"
This is likely caused by a change in a column name - I've added some more checks in the code that will hopefully send a more informative error message and help us figure out what's changed.
Code is working well. Need to develop a Table 11, and write code for table 14. Additionally quality check all outputs against previous mythology
Great, and those are good next steps 👍
- Loading packages getting error with "flextable" (line 14). Have used stack overflow link (https://stackoverflow.com/questions/48199737/cant-install-flextable) and install Homebrew but still getting error below, maybe a Mac issue that was highlighted in one of the comments
"Error: package or namespace load failed for ‘flextable’ in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object
'/Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so':
dlopen(/Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so, 6): Library not
loaded: /opt/X11/lib/libfreetype.6.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so
Reason: image not found"
From the above it looks like you were able to install the flextable library without an error? And the error is coming when you try and load it? It looks like this isn't an error that is specific to flextable (also occurs for kableExtra. You could try the following, if you haven't already:
systemfonts
package
- NSDP Indicators, have taken out ECO 1.4.1- no longer think can be used as a Proxy. The Indicators have are ECO 1.5.2; 1.5.3; 1.6.2; 1.7.1. Also think we could add ENV 1.3.1 and ECO 4.3.1; 4.3.2. However for now will focus on ECO 1.5.2-1.7.1. Link to NSDP M&E Framework: https://www.gov.vu/images/publications/NSDP%20M&E%20Framework.pdf
Pasting screenbshot of these here so we can see them:
- Indicator 1.5.3, code from line 122-125, is this needed? Not quite sure what it is doing
This code (shown below) is removing the ": Exports" from balanceOfTradeByMajorPartnerCountries
data.frame. I've simplified to the following:
colnames(balanceOfTradeByMajorPartnerCountries) <- gsub(pattern = ": Exports", replacement = "", colnames(balanceOfTradeByMajorPartnerCountries))
- Indicator 1.6.2, used hard coding in line 139, how do I adapt this?
I've added the following code:
colnames(timeSeriesGrossDomesticProduct) <- c("Year", "Value")
timeSeriesGrossDomesticProduct <- timeSeriesGrossDomesticProduct[order(timeSeriesGrossDomesticProduct$Year), ]
# Proxy indicator for NSDP ECO 1.6.2
mostRecentGrossDomesticProduct <- timeSeriesGrossDomesticProduct[nrow(timeSeriesGrossDomesticProduct), "Value"] #identify 2018 GDP
The above forces reasonable column names, orders the rows by the year, and selects the value in the second column in the last row. If you are only ever wanting 2018, change last line to:
mostRecentGrossDomesticProduct <- timeSeriesGrossDomesticProduct[timeSeriesGrossDomesticProduct$Year == 2018, "Value"] #identify 2018 GDP
- Calculate Target 1.5.2 (line 170). Using MSG, however in 2019 data PICTA also used, have to add MSG and PICTA to all calculations of Trade Agreements (Line 173). Hard coding again used in line 188- trying to compare current month being analysed to historical months to estimate whether Target being achieved
Not sure I understand this one - are you always wanting to compare to January 2019 or is January the current month for the report and you've hard coded it to work for January? If it's the latter you could use the built in month.name
or month.abb` variables in R to identify what month number you want to use:
myMonth <- "February"
myMonthAsNumber <- which(month.name == myMonth)
myAbbreviatedMonth <- "Feb"
myMonthAsNumber <- which(month.abb == myAbbrevaiatedMonth)
- Calculate Target 1.5.3 (line 192). Have created baseline datasets, using historical data "imports_HS_summaryStats_02-10-20.csv" and "exports_HS_summaryStats_02-10-20.csv", subset of 2016. Not sure how to calculate monthly Trade balance for major partner countries per month.
I think this is a good one to have a chat about at out catch up. I think you'd want to do something like the following:
- Calculate Target 1.6.2 (line 203). Not sure how to calculate value of trade agreement of GDP per month using 2016 GDP to create baseline to compare current data too.
I think this would a be good one to chat about at our catch up - just so we're on the same page about the steps involved.
- Calculate Target 1.7.1 (line 220). Similar issue to target 1.5.3. Not sure how to calculate monthly major commodities per month.
I think this would a be good one to chat about at our catch up - just so we're on the same page about the steps involved.
- Trade Balance by Major Partner Countries: The Data Frame is created, not sure how to visualise it with a map. Most likely need to add a table too to highlight, exports and imports
For plotting on a map, you'll need:
rnarualearth
package for this but you can download a shape file from anywhere (for the leaflet map we used this one). sf
spatial packageggplot()
I've added some example code:
# Get the polygons for the world
worldmap <- rnaturalearth::ne_download(scale = 110,
type = "countries",
category = "cultural",
destdir = tempdir(),
load = TRUE,
returnclass = "sp")
# Convert to sf spatial feature
world <- sf::st_as_sf(worldmap)
# Generate some example data
countryRecords <- data.frame("ISO3" = sample(world$ADM0_A3, size = 50),
"RandomValue" = runif(50))
# Merge in my example data
world <- merge(world, countryRecords, by.x = "ADM0_A3", by.y = "ISO3", all.x = TRUE)
# Plot with ggplot
ggplot(data = world) +
geom_sf(aes(fill = RandomValue))
- Trade Balance of Pacific islands: Hard coding used again lines 396 & 426. Have calculated top 5 imports and exports from Pacific countries. Need to exclude countries with 0, and create bar chart for all others, not sure how to do this
Here's some rough code that combines Import, Export and Balance into single data.frame and feeds into a bar chart that I've added to report script:
# Combine Imports and Exports into single table
totalPacificCountryImportsTransposed <- as.data.frame(t(totalPacificCountryImports))
totalPacificCountryImportsTransposed$Country <- gsub(": Exports|: Imports", "", rownames(totalPacificCountryImportsTransposed))
totalPacificCountryExportsTransposed <- as.data.frame(t(totalPacificCountryExports))
totalPacificCountryExportsTransposed$Country <- gsub(": Exports|: Imports", "", rownames(totalPacificCountryExportsTransposed))
pacificCountryTotals <- merge(totalPacificCountryImportsTransposed, totalPacificCountryExportsTransposed,
by = "Country", all = TRUE, suffixes = c("Imports", "Exports"))
colnames(pacificCountryTotals) <- gsub("V1", "", colnames(pacificCountryTotals))
# Calculate the trade balance
pacificCountryTotals$Balance <- pacificCountryTotals$Exports - pacificCountryTotals$Imports
# Remove the total row and pivot to be in format for bar chart
pacificCountryTotalsLong <- pacificCountryTotals %>%
filter(Country != "Total") %>%
pivot_longer(cols = c("Imports", "Exports", "Balance"),
names_to = "Type", values_to = "Value")
# Create a bar chart
ggplot(pacificCountryTotalsLong, aes(x=Country, y=Value, fill=Type)) +
geom_bar(stat="identity", color="black", position=position_dodge()) +
theme_minimal() +
scale_fill_brewer(palette="Blues") +
theme(legend.title=element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
- Trade Balance of New Emerging Markets: Is it possible to talk me through this code, not following it correctly?
Yes, best to do this during our catch up I think.
- Trade by Trade Agreement: Is it possible to talk me through this code, not following it correctly?
Yes, best to do this during our catch up I think.
*Script 1
Create API to connect to Customs System. This was Stanley’s response who leads on their database.
Line 29: “Delete header row from raw data”- this code is not working. When read in data, the top row is blank- not the column headers. This must be deleted and then row below made into headers. Have tried a few different ways to write the code but none working.
Line 87: After “Set the plotting margins” get a blank pop up that says “Quartz”. Not sure what this is unless supposed to be the bar chart visualisation of missing data…. But as stated the pop up is blank
Merging • Principle Commodities: Currently 378 codes missing from classification table- have to create code to take all rows of this data and extract to CSV, Anna can then review before integrating into the classification table. Also, after merging the number of observations changes from 12,539 to 12,544, why is this? • BEC Classifications: Got classifications from UN in region of 50,000 however only realised after merging that a number of the HS.Code_6 are duplicates. After using new classifications in merge went down from 628 missing codes from classification table to 233. However, number of observations jumped from 12,544 to 175,995! Two questions: How do I delete duplicates from a specific column (HS.Code_6)? Same as above create code to take all rows of data that is not in classification table, create a data-frame and extract to csv
Line 245: “Check the commodity values against expected values based on historic data” this is still giving an error.
*Script 2
Table 11 • Line 658 “Get the trade stats for the exports aligned to the MSG agreement”. The PRF column records imports aligned to the MSG agreement but not exports. So have merged MSG agreement commodities into dataset, and then used column “Not.Included.in.MSG” to filter out right information. The code here I have written seems extremely long winded, is there way to shorten it? • Line 670 & 686: Checking here value differences between using group_by() and creating data-frame to look at top ten exporting commodities. Here figures are different, wasn’t sure why this is. Maybe because group_by() includes exports and reexports (1000, 3071) and data-frame only exports (1000)? • Create table and extract to excel xlsx output sheet 11. To show Top ten exports highlighted in the data-frame (simple table- top ten exports aligned to MSG and their statistical value) • Similar issues from imports from previous two points (figures not matching, create table to extract and add to excel)
Table 14 • Have to sort out classifications, but think code looks good to create data-frame. Just have to format into table and extract into excel
*Script 3 (Overall comment, possible to clean/reduce code to be more efficient)
NSDP Indicators (Policy, Indicator, Target)
• NSDP Indicators 1.5.2, 1.5.3, 1.6.2, 1.71 seem ok. Would be possible to review code and clean up? • NSDP Indicator Targets: there is hard coding when choosing current month need to get rid of. Target 1.7.1 not properly working (merge not functioning properly maybe) • Still have to adapt table and connect to indicators & targets
Trade Balance by Major Partner Countries • Need to take data-frame and visualise it on the map. If space maybe worth adding small table next to map, is this possible with specific values for countries?
Trade Balance of Pacific Countries • Exclude those with zero exports and imports
Trade Balance of new emerging markets • Need to add values above or below bars, as current difficult to see exports as values are small compared to imports
Trade by Trade Agreement (statistical value) • Create cross tab table of exports top 5 to specific pacific islands part of MSG (rows to have exports, column headers country- Fiji, New Caledonia, Papa New Guinea, Solomon Islands) • Create cross tab table of imports top 5 from specific pacific islands part of MSG (rows to have imports, column headers country- Fiji, New Caledonia, Papa New Guinea, Solomon Islands)
Principle Exports & Imports • Have to take current monthly calculation “majorExportsCurrentMonth” & “majorImportsCurrentMonth”and add to historical data-frame. Then visualise monthly values
Top 5 new Major Exports and Imports • “Select top 5 new exports” & “Select top 5 new imports”. Tried to select top five rows using this piping function but not working.
Imports of Products related to Health • Still have to do
Also as a reminder it looks as though some products such as Kava roots are not coded under kava and I found them when searching data frame processedTradeStats and searching "Goods.Comm..Dsc" column.
Is it possible to search for words in R?
Create API to connect to Customs System. This was Stanley’s response who leads on their database.
We can chat about this but I think the next step is to start building a url string and using the functions in the httr and jsonlite packages to run some queries. Something like what is being done in page 2 on this tutorial.
Line 29: “Delete header row from raw data”- this code is not working. When read in data, the top row is blank- not the column headers. This must be deleted and then row below made into headers. Have tried a few different ways to write the code but none working.
Best thing to do in this case is to use the skip
parameter in read.csv()
function. In this case setting skip = 1
when calling read.csv()
will mean R starts reading the dataframe from the second row and skips the first line.
Line 87: After “Set the plotting margins” get a blank pop up that says “Quartz”. Not sure what this is unless supposed to be the bar chart visualisation of missing data…. But as stated the pop up is blank
Weird one, not sure why "Quartz" would be popping up. par(mar=c(10, 4, 4, 1))
is just setting the plotting margins and I think the plot has now been removed so this can be to.
Principle Commodities: Currently 378 codes missing from classification table- have to create code to take all rows of this data and extract to CSV, Anna can then review before integrating into the classification table. Also, after merging the number of observations changes from 12,539 to 12,544, why is this?
Added a TODO here for getting missing data and quality issues flagged in csv for Anna. On the increasing number of rows, this is caused by duplicate values in the HS.Code
column of the principle commodities table.
• BEC Classifications: Got classifications from UN in region of 50,000 however only realised after merging that a number of the HS.Code_6 are duplicates. After using new classifications in merge went down from 628 missing codes from classification table to 233. However, number of observations jumped from 12,544 to 175,995! Two questions: How do I delete duplicates from a specific column (HS.Code_6)? Same as above create code to take all rows of data that is not in classification table, create a data-frame and extract to csv
Yes the observations jumping is worrying - we should add a check for duplicates on the columns used for merging each of the classifications tables. For removing duplicates you can use the duplicated()
or [distinct()
](https://www.datasciencemadesimple.com/remove-duplicate-rows-r-using-dplyr-distinct-function/#:~:text=DataScience%20Made%20Simple-,Remove%20Duplicate%20rows%20in%20R%20using%20Dplyr%20%E2%80%93%20distinct%20()%20function,variable%20or%20with%20multiple%20variable.) functions.
Line 245: “Check the commodity values against expected values based on historic data” this is still giving an error.
This will be a column naming issue I think - I think we are working on different versions of the trade statistics data. NOte that this could be caused by changing headers at lines 29-32 - I'd recommend removing this and using skip
paramter in reaad.csv()
function.
Column names for SEC_PROC_ASY_RawDataAndReferenceTables_31-01-20.csv
:
[1] "Office" "Reg.Ref" "Reg..Date" "Type" "CP4" "CP3" "Declarant"
[8] "Itm.." "HS.Code" "SITC" "PRF" "Goods.Comm..Dsc.1" "Goods.Comm..Dsc." "CO"
[15] "CE.CD" "Wgt..Gross" "Wgt..Net" "Pkg.." "Pkg..type" "Supp.Qty" "Supp"
[22] "TOD" "Stat..Value" "X" "IMD" "VAT" "IEX" "DEX"
[29] "EXD" "OED"
Column names for imports_HS_summaryStats_02-10-20.csv
:
[1] "HS" "Value.Mean" "Value.SD" "Value.Median" "Value.Lower.2.5" "Value.Upper.97.5"
[7] "Value.Lower.1" "Value.Upper.99" "Value.Min" "Value.Max" "Value.Count" "Value.CountMissing"
[13] "UnitValue.Mean" "UnitValue.SD" "UnitValue.Median" "UnitValue.Lower.2.5" "UnitValue.Upper.97.5" "UnitValue.Lower.1"
[19] "UnitValue.Upper.99" "UnitValue.Min" "UnitValue.Max" "UnitValue.Count" "UnitValue.CountMissing"
Column names for exports_HS_summaryStats_02-10-20.csv
:
colnames(historicExportsSummaryStats)
[1] "HS" "Value.Mean" "Value.SD" "Value.Median" "Value.Lower.2.5" "Value.Upper.97.5"
[7] "Value.Lower.1" "Value.Upper.99" "Value.Min" "Value.Max" "Value.Count" "Value.CountMissing"
[13] "UnitValue.Mean" "UnitValue.SD" "UnitValue.Median" "UnitValue.Lower.2.5" "UnitValue.Upper.97.5" "UnitValue.Lower.1"
[19] "UnitValue.Upper.99" "UnitValue.Min" "UnitValue.Max" "UnitValue.Count" "UnitValue.CountMissing"
Column names for tradeStatsCommoditiesMergedWithClassifications
variable:
[1] "HS.Code_6" "HS.Code" "CE.CD" "CO" "SITC_1"
[6] "HS.Code_2" "Office" "Reg.Ref" "Reg..Date" "Type"
[11] "CP4" "CP3" "Declarant" "Itm.." "SITC"
[16] "PRF" "Goods.Comm..Dsc.1" "Goods.Comm..Dsc." "Wgt..Gross" "Wgt..Net"
[21] "Pkg.." "Pkg..type" "Supp.Qty" "Supp" "TOD"
[26] "Stat..Value" "IMD" "VAT" "IEX" "DEX"
[31] "EXD" "OED" "RawDataRowID" "HS.Code_4" "SITC_3"
[36] "Year" "Month" "Day" "Mode.of.Transport" "HS.Section"
[41] "SITC.description" "ï..IMPORT.CTY.CD" "CTY_Origin" "IMPORT.COUNTRY" "IMPORT.REGION"
[46] "IMPORT.REG." "IMPORT.PARTNER.COUNTRIES" "ï..CTY_Dest" "EXPORT.CTY.CD" "EXPORT.COUNTRY"
[51] "EXPORT.REGION" "EXPORT.REG." "EXPORT.PARTNER.COUNTRIES" "TAR_ALL" "TAR_DSC"
[56] "Principle.Exports" "Principle.ReExports" "Principle.Imports" "Classifications.Combined" "Unit.Qty"
[61] "Import.Duty.Rate" "HS92" "HS96" "HS02" "SITC2"
[66] "HS07" "BEC4" "BEC.Descriptions" "Retained.Goods" "SITC1"
[71] "HS12" "SITC3a" "SITC3" "SITC4"
• Line 658 “Get the trade stats for the exports aligned to the MSG agreement”. The PRF column records imports aligned to the MSG agreement but not exports. So have merged MSG agreement commodities into dataset, and then used column “Not.Included.in.MSG” to filter out right information. The code here I have written seems extremely long winded, is there way to shorten it?
A slightly simpler approach that avoids merging:
msgAgreementCommoditiesExcludeFile <- file.path(openDataFolder, "OPN_FINAL_ASY_MSGClassifications_31-01-20.csv")
msgAgreementCommoditiesExclude <- read.csv(msgAgreementCommoditiesExcludeFile, header=TRUE, na.strings=c("","NA", "NULL", "null"))
hsCodesToExclude <- sapply(msgAgreementCommoditiesExclude$HS.Code_6, FUN=padWithZeros, "HS", 6)
tradeStatsForMSGExports <- processedTradeStats[processedTradeStats$HS.Code_6 %in% hsCodesToExclude == FALSE, ]
• Line 670 & 686: Checking here value differences between using group_by() and creating data-frame to look at top ten exporting commodities. Here figures are different, wasn’t sure why this is. Maybe because group_by() includes exports and reexports (1000, 3071) and data-frame only exports (1000)? • Create table and extract to excel xlsx output sheet 11. To show Top ten exports highlighted in the data-frame (simple table- top ten exports aligned to MSG and their statistical value) • Similar issues from imports from previous two points (figures not matching, create table to extract and add to excel)
Lines 670-674 not matching whats calculated at 684-703 - possibly because 684-703 explicitly filters by CP4 (1000).
Not sure I quite understand these - added a TODO below to discuss tomorrow.
Table 14 • Have to sort out classifications, but think code looks good to create data-frame. Just have to format into table and extract into excel
NSDP Indicators 1.5.2, 1.5.3, 1.6.2, 1.71 seem ok. Would be possible to review code and clean up? • NSDP Indicator Targets: there is hard coding when choosing current month need to get rid of. Target 1.7.1 not properly working (merge not functioning properly maybe) • Still have to adapt table and connect to indicators & targets
Probably worth doing a quick run through of these tomorrow so I know where to look. Definitely scope to clean up code in all scripts 👍.
Comparing to same month back in to 2016 baseline data. Currently hard coded as 1 for January but would need to update with every new month. at line 292 under # Calculate value of Trade Agreements of GDP for baseline year for current month
.
Trade Balance by Major Partner Countries • Need to take data-frame and visualise it on the map. If space maybe worth adding small table next to map, is this possible with specific values for countries?
Should be possible but the table would end up being quite small so I'm not sure if it's worth it. First step is to get values reflected on map and add legend.
Trade Balance of new emerging markets • Need to add values above or below bars, as current difficult to see exports as values are small compared to imports
Yes definitely possible and I agree would be helpful. Here's quite a nice answer on StackOverflow about how to add labels above bars with ggplot
Also as a reminder it looks as though some products such as Kava roots are not coded under kava and I found them when searching data frame processedTradeStats and searching "Goods.Comm..Dsc" column.
Is it possible to search for words in R?
Yes definitely R has a suite of functions for searching for patterns in text. You could take a look at the grep()
and str_detect
functions.
Hi @hugo-e-pigott,
As you'll see above I've added some comments to some of the tasks and questions you've noted. I've also added TODO checkboxes where there's a particular thing for the both of us to consider.
One general note is that this issue has become a little unwieldy and lengthy - it might be worthwhile splitting the tasks across multiple issues so it's a little clearer to navigate through the conversations.
I hope the above comments are helpful and looking forward to discussing tomorrow.
Joe
Script 1- Processing
Will share the CSV that is immediately downloaded from the Customs system for April. Will delete the identifiable data. Will have to write code to potentially delete these columns and rows. Potentially do this at the end and produce two csv's, one that is anonymised and another with individual information
Converting SITC to character (line 57), SITC Code 8431.2 isn't in convertible format. Anna checked this in HS Code tables that have corresponding SITC codes, and in fact SITC code should be 0744.92. Have gone back into raw data and changed. Now running no issues
Explore extent of missing data (line 84), checking for large amounts of missing data (>0.1). Can we also look at missing data if only a few cells, maybe create data-frames for each column with missing data 9similar to PRF data-frame)?
Extract 3 and 1 digit SITC Codes (line 100 & 101), two null or NA values in SITC column skipped (HS Codes- 71189000 & 99050000). Checked HS Tables, document states HS Code 71189000 corresponding SITC is null (will have to confirm with Customs), but HS Code 99050000 corresponding SITC is 0990.15- have gone back into raw data and changed. Now only one running issue.
NAs present in any of the above columns (line 104), code View(tradeStatsCommodities[is.na(tradeStatsCommodities$column_name), ]) may not be working
Been through checkingMergingColumnsForClassificationTables() function, would it be possible to quickly explain how this works? Also not sure of the format of the data-frame produced "missingValuesHSCode_2" could explain also?
Merging SITC with cleaned data (line 158). Warning message 1 code in SITC cleaned data that not in classifications. Again not sure of format of data-frame "missingValuesSITC_1" (this to be similar to other merges)
Merging Import with cleaned data (line 171). Warning message "185 codes in the "CO" column of the Import country classification table are not in the "CO" column of the trades statistics table"- very possible.
Merging export with cleaned data (line 184). Warning message "194 codes in the "CE.CD" column of the Export country classification table are not in the "CE.CD" column of the trades statistics table"- very possible
Merging principle commodities with cleaned data (line 198). Warning message "378 codes in the "HS.Code" column of the trades statistics table are not in the "HS.Code" of the Principle Commodities classification table". Need to invest age further and add missing information to classifications table. Also how is this possible when merge previously change in observations is only about 5? Another warning message "3759 codes in the "HS.Code" column of the Principle Commodities classification table are not in the "HS.Code" column of the trades statistics table"- very possible
Merging BEC values with cleaned data (line 211). Warning message "611 codes in the "HS.Code_6" column of the trades statistics table are not in the "HS.Code_6" of the BEC descriptions classification table". Need to invest age further and add missing information to classifications table. Another warning message "3769 codes in the "HS.Code_6" column of the BEC descriptions classification table are not in the "HS.Code_6" column of the trades statistics table"- very possible
For merging principle commodities and BEC have to think through strategy to check missing with UN outline
Check number of rows with NAs present (line 218), not sure what this code is doing- is it a follow on from line 104, highlighting PRF NAs?
Tried to run code (line 235) checking commodity values against expected values and get error "Error in match.names(clabs, names(xi)) : names do not match previous names"
Script 2 Formatted Tables
Script 3 Build Monthly Report
"Error: package or namespace load failed for ‘flextable’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so': dlopen(/Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so, 6): Library not loaded: /opt/X11/lib/libfreetype.6.dylib Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so Reason: image not found"
NSDP Indicators, have taken out ECO 1.4.1- no longer think can be used as a Proxy. The Indicators have are ECO 1.5.2; 1.5.3; 1.6.2; 1.7.1. Also think we could add ENV 1.3.1 and ECO 4.3.1; 4.3.2. However for now will focus on ECO 1.5.2-1.7.1. Below is link to NSDP M&E Framework https://www.gov.vu/images/publications/NSDP%20M&E%20Framework.pdf
Indicator 1.5.3, code from line 122-125, is this needed? Not quite sure what it is doing
Indicator 1.6.2, used hard coding in line 139, how do I adapt this?
Calculate Target 1.5.2 (line 170). Using MSG, however in 2019 data PICTA also used, have to add MSG and PICTA to all calculations of Trade Agreements (Line 173). Hard coding again used in line 188- trying to compare current month being analysed to historical months to estimate whether Target being achieved
Calculate Target 1.5.3 (line 192). Have created baseline datasets, using historical data "imports_HS_summaryStats_02-10-20.csv" and "exports_HS_summaryStats_02-10-20.csv", subset of 2016. Not sure how to calculate monthly Trade balance for major partner countries per month.
Calculate Target 1.6.2 (line 203). Not sure how to calculate value of trade agreement of GDP per month using 2016 GDP to create baseline to compare current data too.
Calculate Target 1.7.1 (line 220). Similar issue to target 1.5.3. Not sure how to calculate monthly major commodities per month.
Trade Balance by Major Partner Countries: The Data Frame is created, not sure how to visualise it with a map. Most likely need to add a table too to highlight, exports and imports
Trade Balance of Pacific islands: Hard coding used again lines 396 & 426. Have calculated top 5 imports and exports from Pacific countries. Need to exclude countries with 0, and create bar chart for all others, not sure how to do this
Trade Balance of New Emerging Markets: Is it possible to talk me through this code, not following it correctly?
Trade by Trade Agreement: Is it possible to talk me through this code, not following it correctly?