OpenBudget / BudgetKey

Opening the Israeli Budget!
https://next.obudget.org
49 stars 15 forks source link

Combine lamas muni data from multiple years into one dataset #465

Open akariv opened 4 years ago

akariv commented 4 years ago

We're currently using a datafile from 2015.

There are newer files and ideally we'd like to have one dataset that spans multiple years.

https://www.cbs.gov.il/he/publications/Pages/2019/%D7%94%D7%A8%D7%A9%D7%95%D7%99%D7%95%D7%AA-%D7%94%D7%9E%D7%A7%D7%95%D7%9E%D7%99%D7%95%D7%AA-%D7%91%D7%99%D7%A9%D7%A8%D7%90%D7%9C-%D7%A7%D7%95%D7%91%D7%A6%D7%99-%D7%A0%D7%AA%D7%95%D7%A0%D7%99%D7%9D-%D7%9C%D7%A2%D7%99%D7%91%D7%95%D7%93-1999-2017.aspx

akariv commented 4 years ago

Thank you!

On Mon, Mar 30, 2020, 20:36 shirawerman notifications@github.com wrote:

I'm starting to work on this project -Shira

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenBudget/BudgetKey/issues/465#issuecomment-606139194, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5PVQHKOL3HHYLP6ZDDRKDKBFANCNFSM4KSTRHFA .

gpipman commented 4 years ago

Hi Adam, my name is Gustavo Pipman and I'm interested in volunteering at Hasadna. The budget issue and the municipal data specifically interested me. I walked through the steps of learning DataFlows , a simple and intuitive way of building data processing flows. Finally, I took '2018.xlsx' one of the LAMAS files in the link supplied by you and generated a Datapackage with 5 resources. One for each worksheet in the workbook. I will appreciate if you can give me more explicit guidelines on what do you expect to receive. In the resources the column titles are column_n , n is the column number. In the spreadsheet the title spread in several rows, with the upper rows are merged cells. data.zip.

akariv commented 4 years ago

Hola Gustavo!

Great to have you on board :) The final output of this flow should be a single file with the following columns: 'city name', 'city symbol', 'year', <columns for all indicators>

The flow should use all source files and combine them into one big happy file :)

The indicator columns should not have generic column_X names but rather indicative names based on the field name. We have a field-name translation routing which you can use / extract here: https://github.com/OpenBudget/budgetkey-data-pipelines/blob/master/datapackage_pipelines_budgetkey/pipelines/lamas/translate_headers.py

On Sun, Apr 19, 2020 at 4:20 PM gpipman notifications@github.com wrote:

Hi Adam, my name is Gustavo Pipman and I'm interested in volunteering at Hasadna. The budget issue and the municipal data specifically interested me. I walked through the steps of learning DataFlows , a simple and intuitive way of building data processing flows. Finally, I took '2018.xlsx' one of the LAMAS files in the link supplied by you and generated a Datapackage with 5 resources. One for each worksheet in the workbook. I will appreciate if you can give me more explicit guidelines on what do you expect to receive. In the resources the column titles are column_n , n is the column number. In the spreadsheet the title spread in several rows, with the upper rows are merged cells. data.zip https://github.com/OpenBudget/BudgetKey/files/4499097/data.zip.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenBudget/BudgetKey/issues/465#issuecomment-616133040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5MJ7V6VOL4AYAJAVULRNL3BTANCNFSM4KSTRHFA .

gpipman commented 4 years ago

municipal_titles.xlsx Hi Adam, I worked on file 2018.xlsx. In the attached file I retrieved the titles for each column for the different sheets in the workbook. The titles are in hebrew as in the original. Before the translation and before to append the titles to the csv files I will appreciate if you can take a look and send me your comments. Regards Gustavo

fluhus commented 4 years ago

Hi Adam et al, I'd like help the obudget project. What's the status of this issue? How can I assist?

Amit

akariv commented 4 years ago

I think it's stalled - have a go at it :)

On Tue, Jun 16, 2020 at 10:38 PM Amit Lavon notifications@github.com wrote:

Hi Adam et al, I'd like help the obudget project. What's the status of this issue? How can I assist?

Amit

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenBudget/BudgetKey/issues/465#issuecomment-644972215, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5NSMWY7FU7VKNGY2R3RW7C3VANCNFSM4KSTRHFA .

fluhus commented 4 years ago

What exactly do you need here?

All the data from all the years combined? Which spreadsheets from each file? Where is it used (or going to be used)?

akariv commented 4 years ago

Yes -

One spreadsheet with the data from all years and all files. Column names need to be in english - you can use the translation table here to help you with that:

https://github.com/OpenBudget/budgetkey-data-pipelines/blob/master/datapackage_pipelines_budgetkey/pipelines/lamas/translate_headers.py

On Wed, Jun 17, 2020 at 7:33 PM Amit Lavon notifications@github.com wrote:

What exactly do you need here?

All the data from all the years combined? Which spreadsheets from each file? Where is it used (or going to be used)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenBudget/BudgetKey/issues/465#issuecomment-645470196, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5JBGPN2VPNJH7RCBZ3RXDV4VANCNFSM4KSTRHFA .

fluhus commented 4 years ago

I am looking into this. Ran into some difficulties opening the xls files. I am trying to use pandas to open them. Is that what you'd recommend to use?

akariv commented 4 years ago

Our preferred method is to use the dataflows library.

You can find more information in the README file of the budgetkey-data-pipelines repo.

On Mon, Jun 22, 2020 at 9:37 PM Amit Lavon notifications@github.com wrote:

I am looking into this. Ran into some difficulties opening the xls files. I am trying to use pandas to open them. Is that what you'd recommend to use?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenBudget/BudgetKey/issues/465#issuecomment-647702463, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5KAGDUDEVDXWHBZGLDRX6QG3ANCNFSM4KSTRHFA .

fluhus commented 4 years ago

Cool, I'll take a look.

fluhus commented 4 years ago

@akariv I am looking in the dataflows tutorial and am not quite sure how to load an xls and get a simple in-memory object to work with. Do you have a concrete example for that?

akariv commented 4 years ago

dataflows is intended to work on data files on a row by row basis - so that you don't load the entire data file into memory (which is a limited resource on the machines we use, especially when other scrapers are running in parallel). Generally, you can give the 'load' processor an xls filename or url and it will open and stream it for you.

On Mon, Jul 6, 2020 at 8:19 PM Amit Lavon notifications@github.com wrote:

@akariv https://github.com/akariv I am looking in the dataflows tutorial https://github.com/datahq/dataflows/blob/master/TUTORIAL.md and am not quite sure how to load an xls and get a simple in-memory object to work with. Do you have a concrete example for that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenBudget/BudgetKey/issues/465#issuecomment-654365003, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5LTV32YY4RLKIOP2FDR2IBTPANCNFSM4KSTRHFA .