Closed malexan closed 6 years ago
Abstract:
The solution is found and partly implemented. This partial implementation already gives the following improvement for TL data (number of not mapped trade flows): 2013: from 684 to 632 2014: from 7565 to 2216
No changes for ES data.
Details:
It turns out the task to set endyear
to 2050 for all not-replaced codes is not trivial. Our "best before" data is linked to HS ranges. We can't simply check is a specific "outdated" HS range replaced by a new one similar range. The new HS range can differ from historical one by its boundaries, but both old and new ranges can have some common HS codes inside.
We have to move from ranges to individual HS codes and check each of them on "best before" year. Not a big problem, but a couple of other issues emerges. For example, with our current approach for moving from ranges to individual codes we get a mapping table with 45*10^9 records, what requires at least 350 Gb of RAM :) We can split the table by acceptable pieces or use a data base. Then after "best before" correction we should go back to HS ranges somehow (or work with individual HS codes, but it is also a question).
The proposed solution is the following:
sel1FCL
chooses one fcl code additionally taking into account the "best before". If there are several matches and one of them have suitable year range, it is taken. If there is no record with suitable year, other matching records are considered. Fully implemented.
BTW, running time now is 42 minutes (parallel). Probably twice as long.
Some countries are still having no HS > FCL map, though their mapping exist. The issue is again that the endyear
variable is set to a number lower than the analised year. See, e.g., the results for 2015 (tldata_not_area_in_fcl_mapping
is the set of countries that were dropped off the raw file because they had no mapping (while actually the reported issue applies; the only countries for which a mapping is actually missing are 7
and 276
, i.e., ,Angola and Sudan, respectively):
> sapply(as.character(unique(tldata_not_area_in_fcl_mapping$reporter)), function(x) hsfclmap3 %>% filter(area == x) %$% table(startyear, endyear))
$`2`
endyear
startyear 2011 2012
2009 2 2942
2010 0 2
2012 0 10
$`7`
< table of extent 0 x 0 >
$`16`
endyear
startyear 2004 2005 2006 2007
2004 6 2 6 3690
2005 0 26 0 12
2006 0 0 4 4
2007 0 0 0 40
$`32`
endyear
startyear 2000 2002 2006 2008 2009 2011 2012
1998 0 0 4 0 0 20 1546
1999 0 0 0 0 0 0 92
2000 12 0 2 0 0 0 32
2001 0 0 2 0 0 0 32
2002 0 4 0 0 0 0 24
2003 0 0 0 0 0 0 12
2004 0 0 0 0 0 0 12
2005 0 0 0 0 0 0 150
2006 0 0 2 0 0 2 32
2007 0 0 0 2 1654 0 0
2009 0 0 0 0 6 0 0
2010 0 0 0 0 0 0 44
2011 0 0 0 0 0 0 30
2012 0 0 0 0 0 0 34
$`90`
endyear
startyear 1995 1996 1997 2008
1995 2 4 4 1278
1996 0 0 0 2
2008 0 0 0 22
$`103`
endyear
startyear 2006 2009
2005 1196 0
2009 0 2536
$`113`
endyear
startyear 1998 1999 2000 2002 2003 2004 2005 2006 2007 2008 2011 2012
1998 1 0 1 0 966 0 4 0 0 660 0 0
1999 0 1 0 0 0 0 0 0 0 0 0 0
2000 0 0 0 0 1 0 0 0 0 0 0 0
2002 0 0 0 3 0 0 0 0 0 0 0 0
2003 0 0 0 0 1 0 0 0 0 0 0 0
2004 0 0 0 0 0 72 2 2 22 0 38 2568
2005 0 0 0 0 0 0 2248 0 0 0 0 8
2006 0 0 0 0 0 0 0 0 0 0 0 46
2007 0 0 0 0 0 0 0 0 2 0 0 2
2008 0 0 0 0 0 0 0 0 0 678 0 0
2010 0 0 0 0 0 0 0 0 0 0 0 50
2011 0 0 0 0 0 0 0 0 0 0 0 6
2012 0 0 0 0 0 0 0 0 0 0 0 50
$`191`
endyear
startyear 2007 2010
2002 0 4252
2003 0 72
2004 0 76
2005 0 46
2006 0 28
2007 14 34
2008 0 34
2009 0 204
2010 0 28
$`196`
endyear
startyear 2007
2000 1170
2001 4
2002 4
2003 10
2005 6
2006 24
2007 8
$`220`
endyear
startyear 2001 2002 2004 2005 2006 2007 2008 2009 2010
2000 0 0 0 0 0 0 0 0 1432
2001 6 0 0 0 0 0 0 0 6
2002 0 10 0 0 0 0 0 0 12
2003 0 0 0 0 0 0 0 0 12
2004 0 0 2 0 0 0 0 0 125
2005 0 0 0 2 0 0 0 0 12
2006 0 0 0 0 2 0 0 0 6
2007 0 0 0 0 0 2 0 0 36
2008 0 0 0 0 0 0 2 0 0
2009 0 0 0 0 0 0 0 2 6
2010 0 0 0 0 0 0 0 0 4
$`225`
endyear
startyear 2002 2003 2007 2008
2002 2 2796 0 0
2003 0 354 0 0
2005 0 0 0 1376
2007 0 0 10 10
$`276`
< table of extent 0 x 0 >
@chrMongeau (email message Tue, 18 Apr 2017 15:28:19 +0000):