SWS-Methodology / faoswsTrade

World trade data processing for the FAO Statistical Working System
http://www.fao.org/economic/ess/ess-home/en/
5 stars 2 forks source link

Set "endyear" to 2050 for all not updated HS->FCL correspondence links #127

Closed malexan closed 6 years ago

malexan commented 7 years ago

@chrMongeau (email message Tue, 18 Apr 2017 15:28:19 +0000):

Let's consider the year 2011. The R object "tldata_not_area_in_fcl_mapping" created by the module contains the countries that are removed from the dataset as there is no HS>FCL mapping. The list for 2011 is (on the first column the FAL code):

6 Andorra 16 Bangladesh 29 Burundi 45 Comoros 85 Greenland 95 Honduras 122 Lesotho 133 Mali 145 Micronesia (Federated States of) 147 Namibia 180 Palau 188 Saint Kitts and Nevis 191 Saint Vincent and the Grenadines 193 Sao Tome and Principe 224 Turks and Caicos Islands 237 Viet Nam 252 Unspecified Area 299 State of Palestine

Given the problem that I mentioned after here:

https://github.com/SWS-Methodology/faoswsTrade/issues/122#issuecomment-294132598

I wanted to check what happened to these countries. It turns out that in the original mapping (hsfclmap3) the only countries for which we never had a mapping are:

6 Andorra 122 Lesotho 145 Micronesia (Federated States of) 224 Turks and Caicos Islands 252 Unspecified Area

(The last one should not be a problem.). For the remaining countries we have some mapping. For instance, for Bangladesh we have:

        endyear
startyear 2004 2005 2006 2007
    2004    6    2    6 3690
    2005    0   26    0   12
    2006    0    0    4    4
    2007    0    0    0   40

Again, it seems to be and "endyear" issue. Thus, given that these countries are removed from the dataset:

tldata <- tldata %>% filter_(~reporter %in% unique(hsfclmap$area))

they will be re-introduced as "nonreporters" and the mirroring will take place. For Bangladesh, we have that the mirroring recovers around 72% of the monetary value of its trade.

I think the solution is to set the "endyear" to 2050 until we know specifically that the code changed in some specific year (and in this case, there should be a new "startyear" for the new code and a new "endyear" for the code that changed).

malexan commented 7 years ago

Abstract:

The solution is found and partly implemented. This partial implementation already gives the following improvement for TL data (number of not mapped trade flows): 2013: from 684 to 632 2014: from 7565 to 2216

No changes for ES data.

Details:

It turns out the task to set endyear to 2050 for all not-replaced codes is not trivial. Our "best before" data is linked to HS ranges. We can't simply check is a specific "outdated" HS range replaced by a new one similar range. The new HS range can differ from historical one by its boundaries, but both old and new ranges can have some common HS codes inside.

We have to move from ranges to individual HS codes and check each of them on "best before" year. Not a big problem, but a couple of other issues emerges. For example, with our current approach for moving from ranges to individual codes we get a mapping table with 45*10^9 records, what requires at least 350 Gb of RAM :) We can split the table by acceptable pieces or use a data base. Then after "best before" correction we should go back to HS ranges somehow (or work with individual HS codes, but it is also a question).

The proposed solution is the following:

  1. IMPLEMENTED. Extract from the whole mapping table hs6->fcl one-to-one mapping records suitable for all years (universal links). It decreases amount of records where we need to deal with "best before" and provides additional one-to-one hs6-fcl codes.
  2. TBD. All trade data records what can not be mapped at hs6-fcl stage are mapped with hs+ approach, where all correspondence links participate, including those what are not suitable for current year. A trade flow gets all possible links, including not suitable for this year. Then an updated function sel1FCL chooses one fcl code additionally taking into account the "best before". If there are several matches and one of them have suitable year range, it is taken. If there is no record with suitable year, other matching records are considered.
malexan commented 7 years ago

Fully implemented.

malexan commented 7 years ago

BTW, running time now is 42 minutes (parallel). Probably twice as long.

chrMongeau commented 7 years ago

Some countries are still having no HS > FCL map, though their mapping exist. The issue is again that the endyear variable is set to a number lower than the analised year. See, e.g., the results for 2015 (tldata_not_area_in_fcl_mapping is the set of countries that were dropped off the raw file because they had no mapping (while actually the reported issue applies; the only countries for which a mapping is actually missing are 7 and 276, i.e., ,Angola and Sudan, respectively):

> sapply(as.character(unique(tldata_not_area_in_fcl_mapping$reporter)), function(x) hsfclmap3 %>% filter(area == x) %$% table(startyear, endyear))
$`2`
         endyear
startyear 2011 2012
     2009    2 2942
     2010    0    2
     2012    0   10

$`7`
< table of extent 0 x 0 >

$`16`
         endyear
startyear 2004 2005 2006 2007
     2004    6    2    6 3690
     2005    0   26    0   12
     2006    0    0    4    4
     2007    0    0    0   40

$`32`
         endyear
startyear 2000 2002 2006 2008 2009 2011 2012
     1998    0    0    4    0    0   20 1546
     1999    0    0    0    0    0    0   92
     2000   12    0    2    0    0    0   32
     2001    0    0    2    0    0    0   32
     2002    0    4    0    0    0    0   24
     2003    0    0    0    0    0    0   12
     2004    0    0    0    0    0    0   12
     2005    0    0    0    0    0    0  150
     2006    0    0    2    0    0    2   32
     2007    0    0    0    2 1654    0    0
     2009    0    0    0    0    6    0    0
     2010    0    0    0    0    0    0   44
     2011    0    0    0    0    0    0   30
     2012    0    0    0    0    0    0   34

$`90`
         endyear
startyear 1995 1996 1997 2008
     1995    2    4    4 1278
     1996    0    0    0    2
     2008    0    0    0   22

$`103`
         endyear
startyear 2006 2009
     2005 1196    0
     2009    0 2536

$`113`
         endyear
startyear 1998 1999 2000 2002 2003 2004 2005 2006 2007 2008 2011 2012
     1998    1    0    1    0  966    0    4    0    0  660    0    0
     1999    0    1    0    0    0    0    0    0    0    0    0    0
     2000    0    0    0    0    1    0    0    0    0    0    0    0
     2002    0    0    0    3    0    0    0    0    0    0    0    0
     2003    0    0    0    0    1    0    0    0    0    0    0    0
     2004    0    0    0    0    0   72    2    2   22    0   38 2568
     2005    0    0    0    0    0    0 2248    0    0    0    0    8
     2006    0    0    0    0    0    0    0    0    0    0    0   46
     2007    0    0    0    0    0    0    0    0    2    0    0    2
     2008    0    0    0    0    0    0    0    0    0  678    0    0
     2010    0    0    0    0    0    0    0    0    0    0    0   50
     2011    0    0    0    0    0    0    0    0    0    0    0    6
     2012    0    0    0    0    0    0    0    0    0    0    0   50

$`191`
         endyear
startyear 2007 2010
     2002    0 4252
     2003    0   72
     2004    0   76
     2005    0   46
     2006    0   28
     2007   14   34
     2008    0   34
     2009    0  204
     2010    0   28

$`196`
         endyear
startyear 2007
     2000 1170
     2001    4
     2002    4
     2003   10
     2005    6
     2006   24
     2007    8

$`220`
         endyear
startyear 2001 2002 2004 2005 2006 2007 2008 2009 2010
     2000    0    0    0    0    0    0    0    0 1432
     2001    6    0    0    0    0    0    0    0    6
     2002    0   10    0    0    0    0    0    0   12
     2003    0    0    0    0    0    0    0    0   12
     2004    0    0    2    0    0    0    0    0  125
     2005    0    0    0    2    0    0    0    0   12
     2006    0    0    0    0    2    0    0    0    6
     2007    0    0    0    0    0    2    0    0   36
     2008    0    0    0    0    0    0    2    0    0
     2009    0    0    0    0    0    0    0    2    6
     2010    0    0    0    0    0    0    0    0    4

$`225`
         endyear
startyear 2002 2003 2007 2008
     2002    2 2796    0    0
     2003    0  354    0    0
     2005    0    0    0 1376
     2007    0    0   10   10

$`276`
< table of extent 0 x 0 >
chrMongeau commented 6 years ago

Done:

https://github.com/SWS-Methodology/faoswsTrade/blob/f6f14fc9bcf59b1073f3ce056348ed2199be6f1d/modules/complete_tf_cpc/main.R#L512-L517