cid-harvard / py-ecomplexity

Python package to compute economic complexity and associated variables
MIT License
63 stars 24 forks source link

ecomplexity output does not conform with atlas dataverse results using R reticulate #21

Closed hamgamb closed 2 years ago

hamgamb commented 2 years ago

Apologies in advance for the not-so reproducible example. I couldn't find a way around the name/email requirements of the dataverse. I am using reticulate in R to run ecomplexity.

Data published on the Harvard Economic Complexity Dataverse has pre-calculated complexity indicators. The country_hsproduct4digit_year data from https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/T4CHWJ/4RG21Y&version=3.0 has columns: location_id, product_id, year, export_value, import_value, export_rca, product_status, cog, distance, normalized_distance, normalized_cog, normalized_pci, export_rpop, is_new, hs_eci, hs_coi, pci, location_code, hs_product_code.

Using only the location_code, hs_product_code, export_value and year columns from that data as input to ecomplexity yields different values for all of the calculated indicators. As an example, the atlas data has an hs_eci for ABW in 1995 as -0.468138129. When calculating complexity indicators from the atlas data the eci for ABW in 1995 is calculated as -0.1471911.

Is the data published on the Harvard Dataverse created using a different method?

matuteiglesias commented 2 years ago

Hi, Try to do a direct comparison on scatterplots. Especially, check if one of the methods is taking z-scores while the other one isn't. Another possible source of differences are taking left vs right eigenvectors when looking for the 2nd eigenvector containing complexity values, although people must have taken care of it already if this was ever happening. Also note that complexity values can be multiplied by -1 and still be "correct". Other than that, check for small differences in the source data, such as some small countries being removed, or some other data cleaning step. Best, Matias

On Tue, Jul 13, 2021 at 3:54 AM Hamish Gamble @.***> wrote:

Apologies in advance for the not-so reproducible example. I couldn't find a way around the name/email requirements of the dataverse. I am using reticulate in R to run ecomplexity.

Data published on the Harvard Economic Complexity Dataverse has pre-calculated complexity indicators. The country_hsproduct4digit_year data from https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/T4CHWJ/4RG21Y&version=3.0 has columns: location_id, product_id, year, export_value, import_value, export_rca, product_status, cog, distance, normalized_distance, normalized_cog, normalized_pci, export_rpop, is_new, hs_eci, hs_coi, pci, location_code, hs_product_code.

Using only the location_code, hs_product_code, export_value and year columns from that data as input to ecomplexity yields different values for all of the calculated indicators. As an example, the atlas data has an hs_eci for ABW in 1995 as -0.468138129. When calculating complexity indicators from the atlas data the eci for ABW in 1995 is calculated as -0.1471911.

Is the data published on the Harvard Dataverse created using a different method?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cid-harvard/py-ecomplexity/issues/21, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHVQVR3FR4NVS2KU6C7L5DTXPPLZANCNFSM5AILZKBA .

-- Matias Nehuen Iglesias, PhD.

wp: (+54 9) 11 3830 8606

hamgamb commented 2 years ago

Thanks Matias for your reply.

The data calculated by this package are normalized by eci. I can confirm that the mean of ECI is 0 and the standard deviation of ECI is 1.

The pre-calculated ECI from the dataverse data is not normalized but is close:

year mean ECI sd ECI
1995 -0.0138702 0.9869372
1996 -0.0172142 0.9702815
1997 -0.0013751 1.0293460
1998 -0.0184518 1.0017776
1999 -0.0009565 0.9855839
2000 0.0250194 0.9951708
2001 0.0522257 0.9701210
2002 0.0458795 0.9585416
2003 0.0377355 0.9721496
2004 0.0383981 0.9639836
2005 0.0233821 0.9737128
2006 0.0611096 0.9824983
2007 0.0534595 0.9712855
2008 0.0498620 0.9636734
2009 0.0881478 0.9712107
2010 0.0712081 0.9720490
2011 0.0646669 0.9753816
2012 0.0383435 1.0044073
2013 0.0426753 1.0085014
2014 0.0353947 0.9829397
2015 0.0392669 0.9958343
2016 0.0422366 0.9964626
2017 0.0275077 0.9895860
2018 0.0539169 0.9958367

After normalizing, its close, but not identical:

image

In regards to any other data cleaning going on, I'm simply using the trade data which comes as part of the pre-calculated data from the dataverse. I don't see how there can be any differences between the two. The number of locations and products by year in the indicators calculated by this package are identical to the number of locations and products by year in the dataverse data.

year locs.calculated prods.calculated locs.source prods.source
1995 231 1247 231 1247
1996 227 1247 227 1247
1997 227 1247 227 1247
1998 226 1247 226 1247
1999 226 1247 226 1247
2000 231 1248 231 1248
2001 233 1248 233 1248
2002 234 1248 234 1248
2003 233 1248 233 1248
2004 234 1248 234 1248
2005 233 1247 233 1247
2006 232 1248 232 1248
2007 233 1247 233 1247
2008 233 1247 233 1247
2009 233 1247 233 1247
2010 233 1245 233 1245
2011 235 1245 235 1245
2012 235 1246 235 1246
2013 237 1243 237 1243
2014 236 1242 236 1242
2015 235 1241 235 1241
2016 234 1240 234 1240
2017 236 1227 236 1227
2018 236 1225 236 1225
matuteiglesias commented 2 years ago

Hi Hamish, Good job. I'm not sure what can be causing the differences in ECI. Still I would make sure the input dataset is really the same, as it may have the same number of rows/columns but have some differences in its values for some reason. Finally, the only time I saw something like this is when we had a confusion with left (as opposed to right) eigenvectors being used in computing ECI. I would check that specific step with extra care. I know that we checked we were correct on this point with the ecomplexity module. Shreyas and Andres Gomes know this step very well and maybe can point you to some useful material. Best Matias

On Tue, Jul 13, 2021 at 9:55 PM Hamish Gamble @.***> wrote:

Thanks Matias for your reply.

The data calculated by this package are normalized by eci. I can confirm that the mean of ECI is 0 and the standard deviation of ECI is 1.

The pre-calculated ECI from the dataverse data is not normalized but is close: year mean ECI sd ECI 1995 -0.0138702 0.9869372 1996 -0.0172142 0.9702815 1997 -0.0013751 1.0293460 1998 -0.0184518 1.0017776 1999 -0.0009565 0.9855839 2000 0.0250194 0.9951708 2001 0.0522257 0.9701210 2002 0.0458795 0.9585416 2003 0.0377355 0.9721496 2004 0.0383981 0.9639836 2005 0.0233821 0.9737128 2006 0.0611096 0.9824983 2007 0.0534595 0.9712855 2008 0.0498620 0.9636734 2009 0.0881478 0.9712107 2010 0.0712081 0.9720490 2011 0.0646669 0.9753816 2012 0.0383435 1.0044073 2013 0.0426753 1.0085014 2014 0.0353947 0.9829397 2015 0.0392669 0.9958343 2016 0.0422366 0.9964626 2017 0.0275077 0.9895860 2018 0.0539169 0.9958367

After normalizing, its close, but not identical:

[image: image] https://user-images.githubusercontent.com/30914420/125543074-4e73d377-7059-40bd-8b0b-368ff2d559e4.png

In regards to any other data cleaning going on, I'm simply using the trade data which comes as part of the pre-calculated data from the dataverse. I don't see how there can be any differences between the two. The number of locations and products by year in the indicators calculated by this package are identical to the number of locations and products by year in the dataverse data. year locs.calculated prods.calculated locs.source prods.source 1995 231 1247 231 1247 1996 227 1247 227 1247 1997 227 1247 227 1247 1998 226 1247 226 1247 1999 226 1247 226 1247 2000 231 1248 231 1248 2001 233 1248 233 1248 2002 234 1248 234 1248 2003 233 1248 233 1248 2004 234 1248 234 1248 2005 233 1247 233 1247 2006 232 1248 232 1248 2007 233 1247 233 1247 2008 233 1247 233 1247 2009 233 1247 233 1247 2010 233 1245 233 1245 2011 235 1245 235 1245 2012 235 1246 235 1246 2013 237 1243 237 1243 2014 236 1242 236 1242 2015 235 1241 235 1241 2016 234 1240 234 1240 2017 236 1227 236 1227 2018 236 1225 236 1225

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cid-harvard/py-ecomplexity/issues/21#issuecomment-879500230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHVQVRFGTTONSTDUXLD663TXTOBDANCNFSM5AILZKBA .

-- Matias Nehuen Iglesias, PhD.

wp: (+54 9) 11 3830 8606

hamgamb commented 2 years ago

I'll just add that the ECI calculated using the R package referenced in #11 does agree with the ECI calculated using this python package. So perhaps something different is being done to the data on the dataverse?

shreyasgm commented 2 years ago

Sorry for the super-late response @hamgamb , but If anyone else is looking for some answers here, the short but possibly unsatisfying answer is that there is more data pre-processing that goes into the dataverse. The ultimate algorithms used to generate the PCI / ECI values are the same, and the differences you rightly call out are a result of the data preprocessing. If you reach out to the team that manages the data uploaded on the dataverse (atlas.cid.harvard.edu), they might be able to offer you exact details of the pre-processing.