Open pstoyanov opened 10 years ago
Thanks for your interest... as I'm sure you know there are many ways to skin a cat (so to speak). Feel free to take a look at the import scripts which are found in the scripts branch of this repo and get back with any feedback you may have for us. We're constantly improving our data collection/calculation methods.
Hello again, I did look at the scripts, and they seem to deal with steps after those in my question. I do not know Python but it seems that you are importing the already processed data, and calculating the respective aggregates, RCAs, growth rates, etc. Are you drawing data from some internal database rather than the original sources you mention? E.g. the Feenstra's WTF Stata/SAS files from http://cid.econ.ucdavis.edu? I am interested in the cleaning up steps (for the SITC data, I assume this is not needed for BACI). My goal was to compare your approach to my current work in matching data from Feenstra’s World Trade Flows database to Comtrade, the EU’s Comext, PWT and World Bank’s WDI. Below are a few detailed examples & questions. Apologies for getting it so long, I will break it up as separate issues if you want me to. Since they are related, I ended up with a giant post.
[Cosmetic issue] You may want to fix the files in the database dump – oec_2014-04-30.sql.bz2
is in the ./static/db/
folder and not in ./static/db/sql
; hs_yodp.csv.bz2
is in the sql folder and not in the csv folder; and there is no csv dump for oec_2014-04-30. It would also be nice to have MD5/SHA/whatever checksums to verify whether the data has changed without re-downloading the large files.
The remarks below are based on the oec_2014-04-30.sql file. I did, however, cut it up with sed to import just the sitc yodp table into R, as I do not have a MySQL server. Even though I think it is a fairly straightforward process, I may have made a mistake.
[Country codes] The country code for Canada is “211240” in Feenstra’s World Trade Flows (WTF), which you give as a source for data prior to 2001. Your scripts seem to use directly the ISO 3-char alpha codes, with the continent prefix. The step of matching WTF country codes to ISO codes is missing (at least I could not locate it in the scripts). As you point out in the updates section for Germany and Benelux, in some cases a single ISO code may correspond to several country names, and at least in theory, vice versa (the same ISO code may be reassigned to another country after a few years). You may want to include the year as part of your key for ISO codes matching. And I assume you are aware that merging country data is a very slippery slope – (1) in most cases with former socialist countries there is little info on the trade between the countries you merge and you end up with overstating their trade flows if you view them as a whole, and (2) there are too many similar cases, over a dozen come to mind only in Eastern/Central Europe and the former USSR…
[World totals] It is not clear what “World” means in your case. Is this a ‘true’ world total, including all flows (i.e. even ones which are only reported at a higher SITC level, 3- or 2-digit), or is this calculated as the sum of the individual flows in the database? My guess is that it is actually a leftover from the original dataset, since it is only present for some years. I think you should be more explicit and consistent (either define & include World for all years, or remove it altogether). Currently the sum of the individual flows matches the World flows for 1962:1973, there are slight (less than 1%) differences over 1974:1983, and no World totals from 1984 onwards. In the WTF there are World totals for all years, calculated as the sum of the individual flows in the dataset.
[SITC codes] The WTF uses a lot of artificial codes (which include A
s and X
s in the codes), which are not present in your database. Looking at the data, it seems that you have simply dropped them – which is OK I guess, since they are not actual products. But since they do not appear uniformly throughout the years, this may be related to the world totals issue above (you are dropping codes disproportionately), and makes the below even stranger.
[Number of flows] Probably the most puzzling of all – there are way too many individual flows. Take year 2000 for example. In the WTF the grand total world trade is 6,568,385,296,000
dollars. In OEC, the sum of the individual flows is 5,599,123,372,802
; or about 14.7% lower. That said, in the WTF there are 670,973
individual flows, after excluding flows where exporter or importer is “World”. In OEC there are 2,026,932
individual flows with origin_id!=”xxwld” & dest_id!=”xxwld” & export_val>0
. The differences are very small for 1962:1983, and then balloon after 1984. Also, starting 1984, the number of trading country pairs is much higher in OEC than in WTF, after being identical over 1962:1983.
Considering WTF carries quantity info, i.e. sometimes there are several records for one exporter/importer/year/product combo (with different quantity units), and OEC does not, I cannot explain this. When I have the time I maybe will go through the data country-by-country, but maybe I am missing something very basic.
Once again, sorry for the long write-up, don't take it as a rant ;)
Please consider posting the scripts used to generate the data from the raw sources, in addition to posting the database dumps and CSV files. A few notes re the World Trade Flows dataset (Feenstra et al. 2005):
Thanks for the good work :)