alexandersimoes / oec

The website powering the Observatory of Economic Complexity
atlas.media.mit.edu
157 stars 52 forks source link

Please add the data manipulation scripts #42

Open pstoyanov opened 10 years ago

pstoyanov commented 10 years ago

Please consider posting the scripts used to generate the data from the raw sources, in addition to posting the database dumps and CSV files. A few notes re the World Trade Flows dataset (Feenstra et al. 2005):

  1. The conversion between the country codes in the WTF to ISO3 alone has separate value going well outside the scope of the Observatory.
  2. It is also not clear how you go from the WTF ???A and ???X codes to the purely numeric ones you use. (A lot of the codes in the WTF are not 'true' 4-digit SITC codes but aggregates). This is crucial to understanding the data after 2001 (e.g. when you supplement the WTF with data from COMTRADE).
  3. It seems that the totals in your dataset (calculated as sum of all flows) are rather different than the world totals as reported in the WTF (the bilateral trade flows, wtf_bilat.zip). This seems to be for 1984 onwards, where you do not have the World as a partner in the dataset. The site therefore cannot generate e.g. bilateral flows World to World.

Thanks for the good work :)

alexandersimoes commented 10 years ago

Thanks for your interest... as I'm sure you know there are many ways to skin a cat (so to speak). Feel free to take a look at the import scripts which are found in the scripts branch of this repo and get back with any feedback you may have for us. We're constantly improving our data collection/calculation methods.

pstoyanov commented 10 years ago

Hello again, I did look at the scripts, and they seem to deal with steps after those in my question. I do not know Python but it seems that you are importing the already processed data, and calculating the respective aggregates, RCAs, growth rates, etc. Are you drawing data from some internal database rather than the original sources you mention? E.g. the Feenstra's WTF Stata/SAS files from http://cid.econ.ucdavis.edu? I am interested in the cleaning up steps (for the SITC data, I assume this is not needed for BACI). My goal was to compare your approach to my current work in matching data from Feenstra’s World Trade Flows database to Comtrade, the EU’s Comext, PWT and World Bank’s WDI. Below are a few detailed examples & questions. Apologies for getting it so long, I will break it up as separate issues if you want me to. Since they are related, I ended up with a giant post.

[Cosmetic issue] You may want to fix the files in the database dump – oec_2014-04-30.sql.bz2 is in the ./static/db/ folder and not in ./static/db/sql; hs_yodp.csv.bz2 is in the sql folder and not in the csv folder; and there is no csv dump for oec_2014-04-30. It would also be nice to have MD5/SHA/whatever checksums to verify whether the data has changed without re-downloading the large files.

The remarks below are based on the oec_2014-04-30.sql file. I did, however, cut it up with sed to import just the sitc yodp table into R, as I do not have a MySQL server. Even though I think it is a fairly straightforward process, I may have made a mistake.

[Country codes] The country code for Canada is “211240” in Feenstra’s World Trade Flows (WTF), which you give as a source for data prior to 2001. Your scripts seem to use directly the ISO 3-char alpha codes, with the continent prefix. The step of matching WTF country codes to ISO codes is missing (at least I could not locate it in the scripts). As you point out in the updates section for Germany and Benelux, in some cases a single ISO code may correspond to several country names, and at least in theory, vice versa (the same ISO code may be reassigned to another country after a few years). You may want to include the year as part of your key for ISO codes matching. And I assume you are aware that merging country data is a very slippery slope – (1) in most cases with former socialist countries there is little info on the trade between the countries you merge and you end up with overstating their trade flows if you view them as a whole, and (2) there are too many similar cases, over a dozen come to mind only in Eastern/Central Europe and the former USSR…

[World totals] It is not clear what “World” means in your case. Is this a ‘true’ world total, including all flows (i.e. even ones which are only reported at a higher SITC level, 3- or 2-digit), or is this calculated as the sum of the individual flows in the database? My guess is that it is actually a leftover from the original dataset, since it is only present for some years. I think you should be more explicit and consistent (either define & include World for all years, or remove it altogether). Currently the sum of the individual flows matches the World flows for 1962:1973, there are slight (less than 1%) differences over 1974:1983, and no World totals from 1984 onwards. In the WTF there are World totals for all years, calculated as the sum of the individual flows in the dataset.

[SITC codes] The WTF uses a lot of artificial codes (which include As and Xs in the codes), which are not present in your database. Looking at the data, it seems that you have simply dropped them – which is OK I guess, since they are not actual products. But since they do not appear uniformly throughout the years, this may be related to the world totals issue above (you are dropping codes disproportionately), and makes the below even stranger.

[Number of flows] Probably the most puzzling of all – there are way too many individual flows. Take year 2000 for example. In the WTF the grand total world trade is 6,568,385,296,000 dollars. In OEC, the sum of the individual flows is 5,599,123,372,802; or about 14.7% lower. That said, in the WTF there are 670,973 individual flows, after excluding flows where exporter or importer is “World”. In OEC there are 2,026,932 individual flows with origin_id!=”xxwld” & dest_id!=”xxwld” & export_val>0. The differences are very small for 1962:1983, and then balloon after 1984. Also, starting 1984, the number of trading country pairs is much higher in OEC than in WTF, after being identical over 1962:1983. Considering WTF carries quantity info, i.e. sometimes there are several records for one exporter/importer/year/product combo (with different quantity units), and OEC does not, I cannot explain this. When I have the time I maybe will go through the data country-by-country, but maybe I am missing something very basic.

Once again, sorry for the long write-up, don't take it as a rant ;)