NRGI / resourcedata.org

CKAN
3 stars 1 forks source link

EITI company payments CSV errors #30

Closed t-morrison closed 7 years ago

t-morrison commented 7 years ago

There are a significant amount of errors in the EITI company payment CSV (here) involving lines being carriage returned inappropriately. There are about 100 of these errors, in my estimate.

The unquoted / in some of the Mongolian payments appears to be one :

image

Something in this DRC payment is another:

image

Many sets of Kazakhstan payments the leading 6.5 variables cut off for a particular company- see 11002 missing from 11003 to 11016; happens many times for KAZ companies:

image

t-morrison commented 7 years ago

KAZ issue happens to Burkina Faso.

The issue is present in the individual level slices.

I don't see others outside of those four countries.

anderspeders commented 7 years ago

Can you review Matt?

mattfullerton commented 7 years ago

@moman822 @anderspeders This seems to always be happening in the "name_of_revenue_stream" field. The text is sometimes split across multiple lines. This is valid CSV (https://tools.ietf.org/html/rfc4180#page-3 - Section 2.6)

If its causing problems somewhere, we can tell the script to strip out new lines from texts, replacing by space or dash.

anderspeders commented 7 years ago

That would be great. Please proceed.

t-morrison commented 7 years ago

This needs to be changed as it prevents proper use of the data. See this screenshot of the data in Excel: image

mattfullerton commented 7 years ago

OK, will do line-end replacement, I think the ";" character might be best

mattfullerton commented 7 years ago

Wasn't just revenue stream, fixed for fields where its likely to come up, please check the combined file after the next import runs

t-morrison commented 7 years ago

Still seeing some issues here. Not the same line break issue but some data inconsistencies still with data missing from columns and all other columns concatenating towards the left.

See: Kazakhstan @ row 10189, 11759

I did not see any besides those two chunks.