Duplicated values in public/gscd_data.csv

WestonAnderson commented 1 year ago

There are duplicate values for some years in our final dataframe for four countries when we use the following command: df[df[df.columns[1:-2].values].duplicated(keep=False)]

The countries are Afghanistan, Niger, Somalia, and Zambia. I have found the issue for Zambia, which is that there are multiple names for Maize in just one or two years. I will follow up on Niger as well.

Can you find the issue with the Afghanistan and Somalia data? Presumably there should be no duplicate values in this final public dataframe

Best, Weston

gnodnooh commented 8 months ago

Hi @WestonAnderson Sorry for the late follow up on this issue. The current final combined file is /public/hvstat_data.csv, and I have found no duplicated rows. check this lines:

df = pd.read_csv('../public/hvstat_data.csv', index_col=0)
assert df[df.columns[:-1].values].duplicated(keep=False).sum() == 0

df.columns[:-1] only excepts for value.

WestonAnderson commented 5 months ago

I believe this has been resolved so I'm closing this out

HarvestStat / HarvestStat

Duplicated values in public/gscd_data.csv #10