Script 1- Processing

Will share the CSV that is immediately downloaded from the Customs system for April. Will delete the identifiable data. Will have to write code to potentially delete these columns and rows. Potentially do this at the end and produce two csv's, one that is anonymised and another with individual information
Converting SITC to character (line 57), SITC Code 8431.2 isn't in convertible format. Anna checked this in HS Code tables that have corresponding SITC codes, and in fact SITC code should be 0744.92. Have gone back into raw data and changed. Now running no issues
Explore extent of missing data (line 84), checking for large amounts of missing data (>0.1). Can we also look at missing data if only a few cells, maybe create data-frames for each column with missing data 9similar to PRF data-frame)?
Extract 3 and 1 digit SITC Codes (line 100 & 101), two null or NA values in SITC column skipped (HS Codes- 71189000 & 99050000). Checked HS Tables, document states HS Code 71189000 corresponding SITC is null (will have to confirm with Customs), but HS Code 99050000 corresponding SITC is 0990.15- have gone back into raw data and changed. Now only one running issue.
NAs present in any of the above columns (line 104), code View(tradeStatsCommodities[is.na(tradeStatsCommodities$column_name), ]) may not be working
Been through checkingMergingColumnsForClassificationTables() function, would it be possible to quickly explain how this works? Also not sure of the format of the data-frame produced "missingValuesHSCode_2" could explain also?
Merging SITC with cleaned data (line 158). Warning message 1 code in SITC cleaned data that not in classifications. Again not sure of format of data-frame "missingValuesSITC_1" (this to be similar to other merges)
Merging Import with cleaned data (line 171). Warning message "185 codes in the "CO" column of the Import country classification table are not in the "CO" column of the trades statistics table"- very possible.
Merging export with cleaned data (line 184). Warning message "194 codes in the "CE.CD" column of the Export country classification table are not in the "CE.CD" column of the trades statistics table"- very possible
Merging principle commodities with cleaned data (line 198). Warning message "378 codes in the "HS.Code" column of the trades statistics table are not in the "HS.Code" of the Principle Commodities classification table". Need to invest age further and add missing information to classifications table. Also how is this possible when merge previously change in observations is only about 5? Another warning message "3759 codes in the "HS.Code" column of the Principle Commodities classification table are not in the "HS.Code" column of the trades statistics table"- very possible
Merging BEC values with cleaned data (line 211). Warning message "611 codes in the "HS.Code_6" column of the trades statistics table are not in the "HS.Code_6" of the BEC descriptions classification table". Need to invest age further and add missing information to classifications table. Another warning message "3769 codes in the "HS.Code_6" column of the BEC descriptions classification table are not in the "HS.Code_6" column of the trades statistics table"- very possible
For merging principle commodities and BEC have to think through strategy to check missing with UN outline
Check number of rows with NAs present (line 218), not sure what this code is doing- is it a follow on from line 104, highlighting PRF NAs?
Tried to run code (line 235) checking commodity values against expected values and get error "Error in match.names(clabs, names(xi)) : names do not match previous names"

Script 2 Formatted Tables

Code is working well. Need to develop a Table 11, and write code for table 14. Additionally quality check all outputs against previous mythology

Script 3 Build Monthly Report

Loading packages getting error with "flextable" (line 14). Have used stack overflow link (https://stackoverflow.com/questions/48199737/cant-install-flextable) and install Homebrew but still getting error below, maybe a Mac issue that was highlighted in one of the comments

"Error: package or namespace load failed for ‘flextable’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so': dlopen(/Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so, 6): Library not loaded: /opt/X11/lib/libfreetype.6.dylib Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so Reason: image not found"

NSDP Indicators, have taken out ECO 1.4.1- no longer think can be used as a Proxy. The Indicators have are ECO 1.5.2; 1.5.3; 1.6.2; 1.7.1. Also think we could add ENV 1.3.1 and ECO 4.3.1; 4.3.2. However for now will focus on ECO 1.5.2-1.7.1. Below is link to NSDP M&E Framework https://www.gov.vu/images/publications/NSDP%20M&E%20Framework.pdf
Indicator 1.5.3, code from line 122-125, is this needed? Not quite sure what it is doing
Indicator 1.6.2, used hard coding in line 139, how do I adapt this?
Calculate Target 1.5.2 (line 170). Using MSG, however in 2019 data PICTA also used, have to add MSG and PICTA to all calculations of Trade Agreements (Line 173). Hard coding again used in line 188- trying to compare current month being analysed to historical months to estimate whether Target being achieved
Calculate Target 1.5.3 (line 192). Have created baseline datasets, using historical data "imports_HS_summaryStats_02-10-20.csv" and "exports_HS_summaryStats_02-10-20.csv", subset of 2016. Not sure how to calculate monthly Trade balance for major partner countries per month.
Calculate Target 1.6.2 (line 203). Not sure how to calculate value of trade agreement of GDP per month using 2016 GDP to create baseline to compare current data too.
Calculate Target 1.7.1 (line 220). Similar issue to target 1.5.3. Not sure how to calculate monthly major commodities per month.
Trade Balance by Major Partner Countries: The Data Frame is created, not sure how to visualise it with a map. Most likely need to add a table too to highlight, exports and imports
Trade Balance of Pacific islands: Hard coding used again lines 396 & 426. Have calculated top 5 imports and exports from Pacific countries. Need to exclude countries with 0, and create bar chart for all others, not sure how to do this
Trade Balance of New Emerging Markets: Is it possible to talk me through this code, not following it correctly?
Trade by Trade Agreement: Is it possible to talk me through this code, not following it correctly?

Script 1- Processing

Will share the CSV that is immediately downloaded from the Customs system for April. Will delete the identifiable data. Will have to write code to potentially delete these columns and rows. Potentially do this at the end and produce two csv's, one that is anonymised and another with individual information

Is this the SEC_PROC_ASY_AnnoymisedRawDataAndReferenceTables_30-04-21.csv file you shared in Skype? We'll also need to the formatted tables up until March 2021 - sorry if you have already sent this and I missed it! 😊

Converting SITC to character (line 57), SITC Code 8431.2 isn't in convertible format. Anna checked this in HS Code tables that have corresponding SITC codes, and in fact SITC code should be 0744.92. Have gone back into raw data and changed. Now running no issues

Great! 👍

Explore extent of missing data (line 84), checking for large amounts of missing data (>0.1). Can we also look at missing data if only a few cells, maybe create data-frames for each column with missing data 9similar to PRF data-frame)?

The numberMissing and proportionMissing vectors provide the number of missing values and proportion of missing values in each column. Here's the numberMissing variable:

   Office   Reg.Ref Reg..Date   Type       CP4       CP3 ...
    0         0         0         0         0         0  ...

Extract 3 and 1 digit SITC Codes (line 100 & 101), two null or NA values in SITC column skipped (HS Codes- 71189000 & 99050000). Checked HS Tables, document states HS Code 71189000 corresponding SITC is null (will have to confirm with Customs), but HS Code 99050000 corresponding SITC is 0990.15- have gone back into raw data and changed. Now only one running issue.

Yes, I think these are carried through from NA values in the SITC column.

NAs present in any of the above columns (line 104), code View(tradeStatsCommodities[is.na(tradeStatsCommodities$column_name), ]) may not be working

This is an example piece of code - you'll need to replace column_name with the name of column that you're interested in. I've added an explanation into the comment.

Been through checkingMergingColumnsForClassificationTables() function, would it be possible to quickly explain how this works? Also not sure of the format of the data-frame produced "missingValuesHSCode_2" could explain also?

The checkingMergingColumnsForClassificationTables() function asks two questions:

Are there any values in the merge column (the column with common values in trades and classification tables that was used for merging) of the trade statistics table that aren't present in merge column of the classification table? https://github.com/Vanuatu-National-Statistics-Office/vnso-RAP-tradeStats-materials/blob/a7b18097068f30fa8fa3837af7159b872446e7af/R/functions.R#L15-L16
Are there any values in the merge column of the classification table not present in the merge column of the trade statistics table? https://github.com/Vanuatu-National-Statistics-Office/vnso-RAP-tradeStats-materials/blob/a7b18097068f30fa8fa3837af7159b872446e7af/R/functions.R#L22-L23

Based on the questions the function will warn the user if any values missing (in each direction of the question), it will also return the missing values in a list structure (missingValuesHSCode_2 for the check on the mode of transport merge):

$NotPresentInClassificationTable
character(0)

$NotPresentInTradesData
character(0)

In the above example no missing values were identified so the NotPresentInClassificationTable and NotPresentInTradesData slots of the list are empty (character(0)).

Because missing values were identified in the merge column of the trades data when compared to the HS 2 classification table, the missingValuesHSCode_2 looks like this:

$NotPresentInClassificationTable
character(0)

 $NotPresentInTradesData
 [1] "13" "26" "43" "47" "50" "53" "55" "75" "77" "80" "81" "93"

Merging SITC with cleaned data (line 158). Warning message 1 code in SITC cleaned data that not in classifications. Again not sure of format of data-frame "missingValuesSITC_1" (this to be similar to other merges)

The merging on SITC for me reports 2 missing values so missingValuesSITC_1 has the following content:

$NotPresentInClassificationTable
[1] NA NA

$NotPresentInTradesData
integer(0)

telling us that there are two values (both NAs) that aren't in the SITC classification table. We could update the checkMergingColumnsForClassificationTables() function to ignore NA values and also only report unique values?

Merging Import with cleaned data (line 171). Warning message "185 codes in the "CO" column of the Import country classification table are not in the "CO" column of the trades statistics table"- very possible.

Yes, we can discuss these warning messages - I think it's good to have so we're always checking the classification tables but in many cases missing data are expected because classification tables often represent a subset of the data.

Merging export with cleaned data (line 184). Warning message "194 codes in the "CE.CD" column of the Export country classification table are not in the "CE.CD" column of the trades statistics table"- very possible

As above.

Merging principle commodities with cleaned data (line 198). Warning message "378 codes in the "HS.Code" column of the trades statistics table are not in the "HS.Code" of the Principle Commodities classification table". Need to invest age further and add missing information to classifications table. Also how is this possible when merge previously change in observations is only about 5? Another warning message "3759 codes in the "HS.Code" column of the Principle Commodities classification table are not in the "HS.Code" column of the trades statistics table"- very possible

As above.

Merging BEC values with cleaned data (line 211). Warning message "611 codes in the "HS.Code_6" column of the trades statistics table are not in the "HS.Code_6" of the BEC descriptions classification table". Need to invest age further and add missing information to classifications table. Another warning message "3769 codes in the "HS.Code_6" column of the BEC descriptions classification table are not in the "HS.Code_6" column of the trades statistics table"- very possible

As above.

For merging principle commodities and BEC have to think through strategy to check missing with UN outline We should revisit this at our next catch up - if there's a UN BEC table available a quick check should be pretty straightforward.

Check number of rows with NAs present (line 218), not sure what this code is doing- is it a follow on from line 104, highlighting PRF NAs?

This code is looking at rows instead of columns and ignoring the PRF column because almost all rows have an NA in this column. The rowsWithNAValues data.frame is a subset of the trades data of rows containing NA values in any column except the PRF column. Not sure if this is useful, if not we could remove?

Tried to run code (line 235) checking commodity values against expected values and get error "Error in match.names(clabs, names(xi)) : names do not match previous names"

This is likely caused by a change in a column name - I've added some more checks in the code that will hopefully send a more informative error message and help us figure out what's changed.

Script 2 Formatted Tables

Code is working well. Need to develop a Table 11, and write code for table 14. Additionally quality check all outputs against previous mythology

Great, and those are good next steps 👍

Script 3 Build Monthly Report

Loading packages getting error with "flextable" (line 14). Have used stack overflow link (https://stackoverflow.com/questions/48199737/cant-install-flextable) and install Homebrew but still getting error below, maybe a Mac issue that was highlighted in one of the comments

"Error: package or namespace load failed for ‘flextable’ in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object 
'/Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so':
dlopen(/Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so, 6): Library not 
loaded: /opt/X11/lib/libfreetype.6.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/systemfonts/libs/systemfonts.so
Reason: image not found"

From the above it looks like you were able to install the flextable library without an error? And the error is coming when you try and load it? It looks like this isn't an error that is specific to flextable (also occurs for kableExtra. You could try the following, if you haven't already:

Installing/re-installing the systemfonts package
Re-installing the felxtable package
Re-installing xquartz
Updating R version and re-installing flextable

NSDP Indicators, have taken out ECO 1.4.1- no longer think can be used as a Proxy. The Indicators have are ECO 1.5.2; 1.5.3; 1.6.2; 1.7.1. Also think we could add ENV 1.3.1 and ECO 4.3.1; 4.3.2. However for now will focus on ECO 1.5.2-1.7.1. Link to NSDP M&E Framework: https://www.gov.vu/images/publications/NSDP%20M&E%20Framework.pdf

Pasting screenbshot of these here so we can see them:

Indicator 1.5.3, code from line 122-125, is this needed? Not quite sure what it is doing

This code (shown below) is removing the ": Exports" from balanceOfTradeByMajorPartnerCountries data.frame. I've simplified to the following:

colnames(balanceOfTradeByMajorPartnerCountries) <- gsub(pattern = ": Exports", replacement = "", colnames(balanceOfTradeByMajorPartnerCountries))

Indicator 1.6.2, used hard coding in line 139, how do I adapt this?

I've added the following code:

colnames(timeSeriesGrossDomesticProduct) <- c("Year", "Value")
timeSeriesGrossDomesticProduct <- timeSeriesGrossDomesticProduct[order(timeSeriesGrossDomesticProduct$Year), ]

# Proxy indicator for NSDP ECO 1.6.2
mostRecentGrossDomesticProduct <- timeSeriesGrossDomesticProduct[nrow(timeSeriesGrossDomesticProduct), "Value"] #identify 2018 GDP

The above forces reasonable column names, orders the rows by the year, and selects the value in the second column in the last row. If you are only ever wanting 2018, change last line to:

mostRecentGrossDomesticProduct <- timeSeriesGrossDomesticProduct[timeSeriesGrossDomesticProduct$Year == 2018, "Value"] #identify 2018 GDP

Calculate Target 1.5.2 (line 170). Using MSG, however in 2019 data PICTA also used, have to add MSG and PICTA to all calculations of Trade Agreements (Line 173). Hard coding again used in line 188- trying to compare current month being analysed to historical months to estimate whether Target being achieved

Not sure I understand this one - are you always wanting to compare to January 2019 or is January the current month for the report and you've hard coded it to work for January? If it's the latter you could use the built in month.name or month.abb` variables in R to identify what month number you want to use:

myMonth <- "February"
myMonthAsNumber <- which(month.name == myMonth)

myAbbreviatedMonth <- "Feb"
myMonthAsNumber <- which(month.abb == myAbbrevaiatedMonth)

Calculate Target 1.5.3 (line 192). Have created baseline datasets, using historical data "imports_HS_summaryStats_02-10-20.csv" and "exports_HS_summaryStats_02-10-20.csv", subset of 2016. Not sure how to calculate monthly Trade balance for major partner countries per month.

I think this is a good one to have a chat about at out catch up. I think you'd want to do something like the following:

Select import records for major partner countries
Aggregate major partner imports by month
Select exports for major partner countries
Aggregate major partner exports by month
Merge Tables created by steps 2. and 4. using the country column
Create a trade balance column by taking the imports column away from the exports column

Calculate Target 1.6.2 (line 203). Not sure how to calculate value of trade agreement of GDP per month using 2016 GDP to create baseline to compare current data too.

I think this would a be good one to chat about at our catch up - just so we're on the same page about the steps involved.

Calculate Target 1.7.1 (line 220). Similar issue to target 1.5.3. Not sure how to calculate monthly major commodities per month.

I think this would a be good one to chat about at our catch up - just so we're on the same page about the steps involved.

Trade Balance by Major Partner Countries: The Data Frame is created, not sure how to visualise it with a map. Most likely need to add a table too to highlight, exports and imports

For plotting on a map, you'll need:

A set of world boundaries (currently using rnarualearth package for this but you can download a shape file from anywhere (for the leaflet map we used this one).
Load the set of boundaries using the sf spatial package
Combine this spatial data with your statistic of interest using a spatial merge on an ISO code column
Once you've combined you can plot using ggplot()

I've added some example code:

# Get the polygons for the world
worldmap <- rnaturalearth::ne_download(scale = 110,
                                       type = "countries",
                                       category = "cultural",
                                       destdir = tempdir(),
                                       load = TRUE,
                                       returnclass = "sp")

# Convert to sf spatial feature
world <- sf::st_as_sf(worldmap)

# Generate some example data
countryRecords <- data.frame("ISO3" = sample(world$ADM0_A3, size = 50),
                         "RandomValue" = runif(50))

# Merge in my example data
world <- merge(world, countryRecords, by.x = "ADM0_A3", by.y = "ISO3", all.x = TRUE)

# Plot with ggplot
ggplot(data = world) +
    geom_sf(aes(fill = RandomValue))

Trade Balance of Pacific islands: Hard coding used again lines 396 & 426. Have calculated top 5 imports and exports from Pacific countries. Need to exclude countries with 0, and create bar chart for all others, not sure how to do this

Here's some rough code that combines Import, Export and Balance into single data.frame and feeds into a bar chart that I've added to report script:

# Combine Imports and Exports into single table
totalPacificCountryImportsTransposed <- as.data.frame(t(totalPacificCountryImports))
totalPacificCountryImportsTransposed$Country <- gsub(": Exports|: Imports", "", rownames(totalPacificCountryImportsTransposed))
totalPacificCountryExportsTransposed <- as.data.frame(t(totalPacificCountryExports))
totalPacificCountryExportsTransposed$Country <- gsub(": Exports|: Imports", "", rownames(totalPacificCountryExportsTransposed))
pacificCountryTotals <- merge(totalPacificCountryImportsTransposed, totalPacificCountryExportsTransposed,
                              by = "Country", all = TRUE, suffixes = c("Imports", "Exports"))
colnames(pacificCountryTotals) <- gsub("V1", "", colnames(pacificCountryTotals))

# Calculate the trade balance
pacificCountryTotals$Balance <- pacificCountryTotals$Exports - pacificCountryTotals$Imports

# Remove the total row and pivot to be in format for bar chart
pacificCountryTotalsLong <- pacificCountryTotals %>%
  filter(Country != "Total") %>%
  pivot_longer(cols = c("Imports", "Exports", "Balance"), 
               names_to = "Type", values_to = "Value")

# Create a bar chart
ggplot(pacificCountryTotalsLong, aes(x=Country, y=Value, fill=Type)) +
  geom_bar(stat="identity", color="black", position=position_dodge()) +
  theme_minimal() + 
  scale_fill_brewer(palette="Blues") +
  theme(legend.title=element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Trade Balance of New Emerging Markets: Is it possible to talk me through this code, not following it correctly?

Yes, best to do this during our catch up I think.

Trade by Trade Agreement: Is it possible to talk me through this code, not following it correctly?

Yes, best to do this during our catch up I think.

*Script 1

Create API to connect to Customs System. This was Stanley’s response who leads on their database.

Line 29: “Delete header row from raw data”- this code is not working. When read in data, the top row is blank- not the column headers. This must be deleted and then row below made into headers. Have tried a few different ways to write the code but none working.

Line 87: After “Set the plotting margins” get a blank pop up that says “Quartz”. Not sure what this is unless supposed to be the bar chart visualisation of missing data…. But as stated the pop up is blank

Merging • Principle Commodities: Currently 378 codes missing from classification table- have to create code to take all rows of this data and extract to CSV, Anna can then review before integrating into the classification table. Also, after merging the number of observations changes from 12,539 to 12,544, why is this? • BEC Classifications: Got classifications from UN in region of 50,000 however only realised after merging that a number of the HS.Code_6 are duplicates. After using new classifications in merge went down from 628 missing codes from classification table to 233. However, number of observations jumped from 12,544 to 175,995! Two questions: How do I delete duplicates from a specific column (HS.Code_6)? Same as above create code to take all rows of data that is not in classification table, create a data-frame and extract to csv

Line 245: “Check the commodity values against expected values based on historic data” this is still giving an error.

*Script 2

Table 11 • Line 658 “Get the trade stats for the exports aligned to the MSG agreement”. The PRF column records imports aligned to the MSG agreement but not exports. So have merged MSG agreement commodities into dataset, and then used column “Not.Included.in.MSG” to filter out right information. The code here I have written seems extremely long winded, is there way to shorten it? • Line 670 & 686: Checking here value differences between using group_by() and creating data-frame to look at top ten exporting commodities. Here figures are different, wasn’t sure why this is. Maybe because group_by() includes exports and reexports (1000, 3071) and data-frame only exports (1000)? • Create table and extract to excel xlsx output sheet 11. To show Top ten exports highlighted in the data-frame (simple table- top ten exports aligned to MSG and their statistical value) • Similar issues from imports from previous two points (figures not matching, create table to extract and add to excel)

Table 14 • Have to sort out classifications, but think code looks good to create data-frame. Just have to format into table and extract into excel

*Script 3 (Overall comment, possible to clean/reduce code to be more efficient)

NSDP Indicators (Policy, Indicator, Target)

• NSDP Indicators 1.5.2, 1.5.3, 1.6.2, 1.71 seem ok. Would be possible to review code and clean up? • NSDP Indicator Targets: there is hard coding when choosing current month need to get rid of. Target 1.7.1 not properly working (merge not functioning properly maybe) • Still have to adapt table and connect to indicators & targets

Trade Balance by Major Partner Countries • Need to take data-frame and visualise it on the map. If space maybe worth adding small table next to map, is this possible with specific values for countries?

Trade Balance of Pacific Countries • Exclude those with zero exports and imports

Trade Balance of new emerging markets • Need to add values above or below bars, as current difficult to see exports as values are small compared to imports

Trade by Trade Agreement (statistical value) • Create cross tab table of exports top 5 to specific pacific islands part of MSG (rows to have exports, column headers country- Fiji, New Caledonia, Papa New Guinea, Solomon Islands) • Create cross tab table of imports top 5 from specific pacific islands part of MSG (rows to have imports, column headers country- Fiji, New Caledonia, Papa New Guinea, Solomon Islands)

Principle Exports & Imports • Have to take current monthly calculation “majorExportsCurrentMonth” & “majorImportsCurrentMonth”and add to historical data-frame. Then visualise monthly values

Top 5 new Major Exports and Imports • “Select top 5 new exports” & “Select top 5 new imports”. Tried to select top five rows using this piping function but not working.

Imports of Products related to Health • Still have to do

Also as a reminder it looks as though some products such as Kava roots are not coded under kava and I found them when searching data frame processedTradeStats and searching "Goods.Comm..Dsc" column.

Is it possible to search for words in R?

Create API to connect to Customs System. This was Stanley’s response who leads on their database.

We can chat about this but I think the next step is to start building a url string and using the functions in the httr and jsonlite packages to run some queries. Something like what is being done in page 2 on this tutorial.

Line 29: “Delete header row from raw data”- this code is not working. When read in data, the top row is blank- not the column headers. This must be deleted and then row below made into headers. Have tried a few different ways to write the code but none working.

Best thing to do in this case is to use the skip parameter in read.csv() function. In this case setting skip = 1 when calling read.csv() will mean R starts reading the dataframe from the second row and skips the first line.

Line 87: After “Set the plotting margins” get a blank pop up that says “Quartz”. Not sure what this is unless supposed to be the bar chart visualisation of missing data…. But as stated the pop up is blank

Weird one, not sure why "Quartz" would be popping up. par(mar=c(10, 4, 4, 1)) is just setting the plotting margins and I think the plot has now been removed so this can be to.

[ ] TODO - check with @hugo-e-pigott to replicate this one

Principle Commodities: Currently 378 codes missing from classification table- have to create code to take all rows of this data and extract to CSV, Anna can then review before integrating into the classification table. Also, after merging the number of observations changes from 12,539 to 12,544, why is this?

Added a TODO here for getting missing data and quality issues flagged in csv for Anna. On the increasing number of rows, this is caused by duplicate values in the HS.Code column of the principle commodities table.

[ ] TODO rethink flagging of missing data and quality issues - map to raw data and export into CSV
[ ] TODO only merge necessary columns - some duplicates being brought in on merging?

• BEC Classifications: Got classifications from UN in region of 50,000 however only realised after merging that a number of the HS.Code_6 are duplicates. After using new classifications in merge went down from 628 missing codes from classification table to 233. However, number of observations jumped from 12,544 to 175,995! Two questions: How do I delete duplicates from a specific column (HS.Code_6)? Same as above create code to take all rows of data that is not in classification table, create a data-frame and extract to csv

Yes the observations jumping is worrying - we should add a check for duplicates on the columns used for merging each of the classifications tables. For removing duplicates you can use the duplicated() or [distinct()](https://www.datasciencemadesimple.com/remove-duplicate-rows-r-using-dplyr-distinct-function/#:~:text=DataScience%20Made%20Simple-,Remove%20Duplicate%20rows%20in%20R%20using%20Dplyr%20%E2%80%93%20distinct%20()%20function,variable%20or%20with%20multiple%20variable.) functions.

[ ] TODO Add in approach to check for duplicate values in classification ID columns used for merging - remove where present

Line 245: “Check the commodity values against expected values based on historic data” this is still giving an error.

This will be a column naming issue I think - I think we are working on different versions of the trade statistics data. NOte that this could be caused by changing headers at lines 29-32 - I'd recommend removing this and using skip paramter in reaad.csv() function.

[ ] TODO @hugo-e-pigott to share version of trade statistics data so @JosephCrispell can replicate error
[x] TODO @JosephCrispell to send screenshot of headers for the historical imports/exports table

Column names for SEC_PROC_ASY_RawDataAndReferenceTables_31-01-20.csv:

 [1] "Office"            "Reg.Ref"           "Reg..Date"         "Type"              "CP4"               "CP3"               "Declarant"        
 [8] "Itm.."             "HS.Code"           "SITC"              "PRF"               "Goods.Comm..Dsc.1" "Goods.Comm..Dsc."  "CO"               
[15] "CE.CD"             "Wgt..Gross"        "Wgt..Net"          "Pkg.."             "Pkg..type"         "Supp.Qty"          "Supp"             
[22] "TOD"               "Stat..Value"       "X"                 "IMD"               "VAT"               "IEX"               "DEX"              
[29] "EXD"               "OED"

Column names for imports_HS_summaryStats_02-10-20.csv:

 [1] "HS"                     "Value.Mean"             "Value.SD"               "Value.Median"           "Value.Lower.2.5"        "Value.Upper.97.5"      
 [7] "Value.Lower.1"          "Value.Upper.99"         "Value.Min"              "Value.Max"              "Value.Count"            "Value.CountMissing"    
[13] "UnitValue.Mean"         "UnitValue.SD"           "UnitValue.Median"       "UnitValue.Lower.2.5"    "UnitValue.Upper.97.5"   "UnitValue.Lower.1"     
[19] "UnitValue.Upper.99"     "UnitValue.Min"          "UnitValue.Max"          "UnitValue.Count"        "UnitValue.CountMissing"

Column names for exports_HS_summaryStats_02-10-20.csv:

colnames(historicExportsSummaryStats)
 [1] "HS"                     "Value.Mean"             "Value.SD"               "Value.Median"           "Value.Lower.2.5"        "Value.Upper.97.5"      
 [7] "Value.Lower.1"          "Value.Upper.99"         "Value.Min"              "Value.Max"              "Value.Count"            "Value.CountMissing"    
[13] "UnitValue.Mean"         "UnitValue.SD"           "UnitValue.Median"       "UnitValue.Lower.2.5"    "UnitValue.Upper.97.5"   "UnitValue.Lower.1"     
[19] "UnitValue.Upper.99"     "UnitValue.Min"          "UnitValue.Max"          "UnitValue.Count"        "UnitValue.CountMissing"

Column names for tradeStatsCommoditiesMergedWithClassifications variable:

 [1] "HS.Code_6"                "HS.Code"                  "CE.CD"                    "CO"                       "SITC_1"                  
 [6] "HS.Code_2"                "Office"                   "Reg.Ref"                  "Reg..Date"                "Type"                    
[11] "CP4"                      "CP3"                      "Declarant"                "Itm.."                    "SITC"                    
[16] "PRF"                      "Goods.Comm..Dsc.1"        "Goods.Comm..Dsc."         "Wgt..Gross"               "Wgt..Net"                
[21] "Pkg.."                    "Pkg..type"                "Supp.Qty"                 "Supp"                     "TOD"                     
[26] "Stat..Value"              "IMD"                      "VAT"                      "IEX"                      "DEX"                     
[31] "EXD"                      "OED"                      "RawDataRowID"             "HS.Code_4"                "SITC_3"                  
[36] "Year"                     "Month"                    "Day"                      "Mode.of.Transport"        "HS.Section"              
[41] "SITC.description"         "ï..IMPORT.CTY.CD"         "CTY_Origin"               "IMPORT.COUNTRY"           "IMPORT.REGION"           
[46] "IMPORT.REG."              "IMPORT.PARTNER.COUNTRIES" "ï..CTY_Dest"              "EXPORT.CTY.CD"            "EXPORT.COUNTRY"          
[51] "EXPORT.REGION"            "EXPORT.REG."              "EXPORT.PARTNER.COUNTRIES" "TAR_ALL"                  "TAR_DSC"                 
[56] "Principle.Exports"        "Principle.ReExports"      "Principle.Imports"        "Classifications.Combined" "Unit.Qty"                
[61] "Import.Duty.Rate"         "HS92"                     "HS96"                     "HS02"                     "SITC2"                   
[66] "HS07"                     "BEC4"                     "BEC.Descriptions"         "Retained.Goods"           "SITC1"                   
[71] "HS12"                     "SITC3a"                   "SITC3"                    "SITC4"

• Line 658 “Get the trade stats for the exports aligned to the MSG agreement”. The PRF column records imports aligned to the MSG agreement but not exports. So have merged MSG agreement commodities into dataset, and then used column “Not.Included.in.MSG” to filter out right information. The code here I have written seems extremely long winded, is there way to shorten it?

A slightly simpler approach that avoids merging:

msgAgreementCommoditiesExcludeFile <- file.path(openDataFolder, "OPN_FINAL_ASY_MSGClassifications_31-01-20.csv")
msgAgreementCommoditiesExclude <- read.csv(msgAgreementCommoditiesExcludeFile, header=TRUE, na.strings=c("","NA", "NULL", "null"))
hsCodesToExclude <- sapply(msgAgreementCommoditiesExclude$HS.Code_6, FUN=padWithZeros, "HS", 6)
tradeStatsForMSGExports <- processedTradeStats[processedTradeStats$HS.Code_6 %in% hsCodesToExclude == FALSE, ]

• Line 670 & 686: Checking here value differences between using group_by() and creating data-frame to look at top ten exporting commodities. Here figures are different, wasn’t sure why this is. Maybe because group_by() includes exports and reexports (1000, 3071) and data-frame only exports (1000)? • Create table and extract to excel xlsx output sheet 11. To show Top ten exports highlighted in the data-frame (simple table- top ten exports aligned to MSG and their statistical value) • Similar issues from imports from previous two points (figures not matching, create table to extract and add to excel)

Lines 670-674 not matching whats calculated at 684-703 - possibly because 684-703 explicitly filters by CP4 (1000).

Not sure I quite understand these - added a TODO below to discuss tomorrow.

[ ] TODO Discuss these tasks to add clarity

Table 14 • Have to sort out classifications, but think code looks good to create data-frame. Just have to format into table and extract into excel

[ ] TODO Add code to insert formatted table into excel file

NSDP Indicators 1.5.2, 1.5.3, 1.6.2, 1.71 seem ok. Would be possible to review code and clean up? • NSDP Indicator Targets: there is hard coding when choosing current month need to get rid of. Target 1.7.1 not properly working (merge not functioning properly maybe) • Still have to adapt table and connect to indicators & targets

Probably worth doing a quick run through of these tomorrow so I know where to look. Definitely scope to clean up code in all scripts 👍. Comparing to same month back in to 2016 baseline data. Currently hard coded as 1 for January but would need to update with every new month. at line 292 under # Calculate value of Trade Agreements of GDP for baseline year for current month.

[ ] TODO Run through code for above indicators to discuss improvements

Trade Balance by Major Partner Countries • Need to take data-frame and visualise it on the map. If space maybe worth adding small table next to map, is this possible with specific values for countries?

Should be possible but the table would end up being quite small so I'm not sure if it's worth it. First step is to get values reflected on map and add legend.

[ ] TODO Update map with true values and add legend

Trade Balance of new emerging markets • Need to add values above or below bars, as current difficult to see exports as values are small compared to imports

Yes definitely possible and I agree would be helpful. Here's quite a nice answer on StackOverflow about how to add labels above bars with ggplot

Also as a reminder it looks as though some products such as Kava roots are not coded under kava and I found them when searching data frame processedTradeStats and searching "Goods.Comm..Dsc" column.

Is it possible to search for words in R?

Yes definitely R has a suite of functions for searching for patterns in text. You could take a look at the grep() and str_detect functions.

[ ] TODO @hugo-e-pigott to add a little more detail here so @JosephCrispell can point to solution using pattern based search.

Hi @hugo-e-pigott,

As you'll see above I've added some comments to some of the tasks and questions you've noted. I've also added TODO checkboxes where there's a particular thing for the both of us to consider.

One general note is that this issue has become a little unwieldy and lengthy - it might be worthwhile splitting the tasks across multiple issues so it's a little clearer to navigate through the conversations.

I hope the above comments are helpful and looking forward to discussing tomorrow.

Joe

Vanuatu-National-Statistics-Office / vnso-RAP-tradeStats-materials

Monthly Report- Whats left #94

Script 1- Processing

Script 2 Formatted Tables

Script 3 Build Monthly Report