NRGI / resourcedata.org

CKAN
3 stars 1 forks source link

Explore generation of a single EITI summary data filee per country #28

Open anderspeders opened 7 years ago

anderspeders commented 7 years ago

Why

In addition to the split file currently available it would would be great to explore the following route for creating single file when the data match.

Please see instructions from Development Gateway below. There's a chance to eliminate the double counting in the flattened files under the following conditions: 1 - The Excel sheet should have disaggregated company information. 2 - The Government reported revenue and the Company reported revenue should match. 3 - Then you group by GFS Code + Name of revenue stream 4 - The rows of type "company" in column G aggregated by value_reported should be equal to the row of type "government" same column. If all those conditions are met, then you can copy the value in column name_of_recieving_agency from the "government"type row, to the "company" type rows that have it empty, and delete the former, effectively eliminating the double counting.

What

@mattfullerton Would you be able to take a look at this and see what share of EITI files would be able to pass this test.

We should then discuss if we can create this as a flat file in supplement or simply replace when the file can be generated.

Notes

mattfullerton commented 7 years ago

First look at this: 1921 reports have disaggregated company information and 624 do not. The last 3 rows will need a good deal more time to figure out.

anderspeders commented 7 years ago

Ok, please prioritise the RGI source tool. We can keep this pending given that it seems a bit tricky to solve.

anderspeders commented 7 years ago

I believe that we can close this now as this has been done, right?

mattfullerton commented 7 years ago

It hasn't, no

mattfullerton commented 7 years ago

@anderspeders Could you make a comment on how urgent/important this is? And (or @moman822) could you send a small example for the separated files that illustrate points 1-4 with the data?

anderspeders commented 7 years ago

Following call today, please cost this approx. and we can then sign move this item forward.

anderspeders commented 7 years ago

Trying to recap where we are on this ticket. I have not been able to follow the conversation in slack and do not see anything captured in github.

My understanding is however that most file cannot be generated as a single file at the momemt due to the fact that they do not fully match. If that is the case for more than half of the counties I suggest that we keep the current setup as is and await to implement this for when the data from the EITI API reaches a high data quality.

Thoughts are welcome - until removing priority label.

mattfullerton commented 7 years ago

Preliminary results are such:

All matched % | None matched % | Partially matched % 6.896551724137931 | 43.8871473354232 | 49.21630094043887

I have already posted the results per transaction in Slack and will post an update on that and a summary per report. We could dig a bit deeper to see if there's still something we're (=I'm) missing, but the fact that the method works very well for some reports and partially for others makes me think we're doing it right.