Closed hamgamb closed 2 years ago
Hi, Try to do a direct comparison on scatterplots. Especially, check if one of the methods is taking z-scores while the other one isn't. Another possible source of differences are taking left vs right eigenvectors when looking for the 2nd eigenvector containing complexity values, although people must have taken care of it already if this was ever happening. Also note that complexity values can be multiplied by -1 and still be "correct". Other than that, check for small differences in the source data, such as some small countries being removed, or some other data cleaning step. Best, Matias
On Tue, Jul 13, 2021 at 3:54 AM Hamish Gamble @.***> wrote:
Apologies in advance for the not-so reproducible example. I couldn't find a way around the name/email requirements of the dataverse. I am using reticulate in R to run ecomplexity.
Data published on the Harvard Economic Complexity Dataverse has pre-calculated complexity indicators. The country_hsproduct4digit_year data from https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/T4CHWJ/4RG21Y&version=3.0 has columns: location_id, product_id, year, export_value, import_value, export_rca, product_status, cog, distance, normalized_distance, normalized_cog, normalized_pci, export_rpop, is_new, hs_eci, hs_coi, pci, location_code, hs_product_code.
Using only the location_code, hs_product_code, export_value and year columns from that data as input to ecomplexity yields different values for all of the calculated indicators. As an example, the atlas data has an hs_eci for ABW in 1995 as -0.468138129. When calculating complexity indicators from the atlas data the eci for ABW in 1995 is calculated as -0.1471911.
Is the data published on the Harvard Dataverse created using a different method?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cid-harvard/py-ecomplexity/issues/21, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHVQVR3FR4NVS2KU6C7L5DTXPPLZANCNFSM5AILZKBA .
-- Matias Nehuen Iglesias, PhD.
wp: (+54 9) 11 3830 8606
Thanks Matias for your reply.
The data calculated by this package are normalized by eci. I can confirm that the mean of ECI is 0 and the standard deviation of ECI is 1.
The pre-calculated ECI from the dataverse data is not normalized but is close:
year | mean ECI | sd ECI |
---|---|---|
1995 | -0.0138702 | 0.9869372 |
1996 | -0.0172142 | 0.9702815 |
1997 | -0.0013751 | 1.0293460 |
1998 | -0.0184518 | 1.0017776 |
1999 | -0.0009565 | 0.9855839 |
2000 | 0.0250194 | 0.9951708 |
2001 | 0.0522257 | 0.9701210 |
2002 | 0.0458795 | 0.9585416 |
2003 | 0.0377355 | 0.9721496 |
2004 | 0.0383981 | 0.9639836 |
2005 | 0.0233821 | 0.9737128 |
2006 | 0.0611096 | 0.9824983 |
2007 | 0.0534595 | 0.9712855 |
2008 | 0.0498620 | 0.9636734 |
2009 | 0.0881478 | 0.9712107 |
2010 | 0.0712081 | 0.9720490 |
2011 | 0.0646669 | 0.9753816 |
2012 | 0.0383435 | 1.0044073 |
2013 | 0.0426753 | 1.0085014 |
2014 | 0.0353947 | 0.9829397 |
2015 | 0.0392669 | 0.9958343 |
2016 | 0.0422366 | 0.9964626 |
2017 | 0.0275077 | 0.9895860 |
2018 | 0.0539169 | 0.9958367 |
After normalizing, its close, but not identical:
In regards to any other data cleaning going on, I'm simply using the trade data which comes as part of the pre-calculated data from the dataverse. I don't see how there can be any differences between the two. The number of locations and products by year in the indicators calculated by this package are identical to the number of locations and products by year in the dataverse data.
year | locs.calculated | prods.calculated | locs.source | prods.source |
---|---|---|---|---|
1995 | 231 | 1247 | 231 | 1247 |
1996 | 227 | 1247 | 227 | 1247 |
1997 | 227 | 1247 | 227 | 1247 |
1998 | 226 | 1247 | 226 | 1247 |
1999 | 226 | 1247 | 226 | 1247 |
2000 | 231 | 1248 | 231 | 1248 |
2001 | 233 | 1248 | 233 | 1248 |
2002 | 234 | 1248 | 234 | 1248 |
2003 | 233 | 1248 | 233 | 1248 |
2004 | 234 | 1248 | 234 | 1248 |
2005 | 233 | 1247 | 233 | 1247 |
2006 | 232 | 1248 | 232 | 1248 |
2007 | 233 | 1247 | 233 | 1247 |
2008 | 233 | 1247 | 233 | 1247 |
2009 | 233 | 1247 | 233 | 1247 |
2010 | 233 | 1245 | 233 | 1245 |
2011 | 235 | 1245 | 235 | 1245 |
2012 | 235 | 1246 | 235 | 1246 |
2013 | 237 | 1243 | 237 | 1243 |
2014 | 236 | 1242 | 236 | 1242 |
2015 | 235 | 1241 | 235 | 1241 |
2016 | 234 | 1240 | 234 | 1240 |
2017 | 236 | 1227 | 236 | 1227 |
2018 | 236 | 1225 | 236 | 1225 |
Hi Hamish, Good job. I'm not sure what can be causing the differences in ECI. Still I would make sure the input dataset is really the same, as it may have the same number of rows/columns but have some differences in its values for some reason. Finally, the only time I saw something like this is when we had a confusion with left (as opposed to right) eigenvectors being used in computing ECI. I would check that specific step with extra care. I know that we checked we were correct on this point with the ecomplexity module. Shreyas and Andres Gomes know this step very well and maybe can point you to some useful material. Best Matias
On Tue, Jul 13, 2021 at 9:55 PM Hamish Gamble @.***> wrote:
Thanks Matias for your reply.
The data calculated by this package are normalized by eci. I can confirm that the mean of ECI is 0 and the standard deviation of ECI is 1.
The pre-calculated ECI from the dataverse data is not normalized but is close: year mean ECI sd ECI 1995 -0.0138702 0.9869372 1996 -0.0172142 0.9702815 1997 -0.0013751 1.0293460 1998 -0.0184518 1.0017776 1999 -0.0009565 0.9855839 2000 0.0250194 0.9951708 2001 0.0522257 0.9701210 2002 0.0458795 0.9585416 2003 0.0377355 0.9721496 2004 0.0383981 0.9639836 2005 0.0233821 0.9737128 2006 0.0611096 0.9824983 2007 0.0534595 0.9712855 2008 0.0498620 0.9636734 2009 0.0881478 0.9712107 2010 0.0712081 0.9720490 2011 0.0646669 0.9753816 2012 0.0383435 1.0044073 2013 0.0426753 1.0085014 2014 0.0353947 0.9829397 2015 0.0392669 0.9958343 2016 0.0422366 0.9964626 2017 0.0275077 0.9895860 2018 0.0539169 0.9958367
After normalizing, its close, but not identical:
[image: image] https://user-images.githubusercontent.com/30914420/125543074-4e73d377-7059-40bd-8b0b-368ff2d559e4.png
In regards to any other data cleaning going on, I'm simply using the trade data which comes as part of the pre-calculated data from the dataverse. I don't see how there can be any differences between the two. The number of locations and products by year in the indicators calculated by this package are identical to the number of locations and products by year in the dataverse data. year locs.calculated prods.calculated locs.source prods.source 1995 231 1247 231 1247 1996 227 1247 227 1247 1997 227 1247 227 1247 1998 226 1247 226 1247 1999 226 1247 226 1247 2000 231 1248 231 1248 2001 233 1248 233 1248 2002 234 1248 234 1248 2003 233 1248 233 1248 2004 234 1248 234 1248 2005 233 1247 233 1247 2006 232 1248 232 1248 2007 233 1247 233 1247 2008 233 1247 233 1247 2009 233 1247 233 1247 2010 233 1245 233 1245 2011 235 1245 235 1245 2012 235 1246 235 1246 2013 237 1243 237 1243 2014 236 1242 236 1242 2015 235 1241 235 1241 2016 234 1240 234 1240 2017 236 1227 236 1227 2018 236 1225 236 1225
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cid-harvard/py-ecomplexity/issues/21#issuecomment-879500230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHVQVRFGTTONSTDUXLD663TXTOBDANCNFSM5AILZKBA .
-- Matias Nehuen Iglesias, PhD.
wp: (+54 9) 11 3830 8606
I'll just add that the ECI calculated using the R package referenced in #11 does agree with the ECI calculated using this python package. So perhaps something different is being done to the data on the dataverse?
Sorry for the super-late response @hamgamb , but If anyone else is looking for some answers here, the short but possibly unsatisfying answer is that there is more data pre-processing that goes into the dataverse. The ultimate algorithms used to generate the PCI / ECI values are the same, and the differences you rightly call out are a result of the data preprocessing. If you reach out to the team that manages the data uploaded on the dataverse (atlas.cid.harvard.edu), they might be able to offer you exact details of the pre-processing.
Apologies in advance for the not-so reproducible example. I couldn't find a way around the name/email requirements of the dataverse. I am using reticulate in R to run ecomplexity.
Data published on the Harvard Economic Complexity Dataverse has pre-calculated complexity indicators. The
country_hsproduct4digit_year
data from https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/T4CHWJ/4RG21Y&version=3.0 has columns:location_id, product_id, year, export_value, import_value, export_rca, product_status, cog, distance, normalized_distance, normalized_cog, normalized_pci, export_rpop, is_new, hs_eci, hs_coi, pci, location_code, hs_product_code
.Using only the
location_code, hs_product_code, export_value and year
columns from that data as input to ecomplexity yields different values for all of the calculated indicators. As an example, the atlas data has anhs_eci
forABW
in1995
as-0.468138129
. When calculating complexity indicators from the atlas data theeci
forABW
in1995
is calculated as-0.1471911
.Is the data published on the Harvard Dataverse created using a different method?