Nepal datasets updated - Githubissues

datasets / publicbodies

A database of public bodies such as government departments, ministries etc.

http://publicbodies.org

MIT License

63 stars 26 forks source link

Nepal datasets updated #139

Closed nikeshbalami closed 2 years ago

nikeshbalami commented 2 years ago

List of changes made

Duplicate value removed from the old datasets
New datasets added from ODD work: https://github.com/augusto-herrmann/opendataday-publicbodies-nepal

augusto-herrmann commented 2 years ago

Thank you for the awesome work, @nikeshbalami! I intend to look into this as soon as possible.

Did you make any validation with the Fricionless Data Package table schema? In any case, if you don't mind, before merging this I'd like to handle issue #135, as it will automatically verify schema validity for all data in all incoming pull requests.

augusto-herrmann commented 2 years ago

Looking at the id column, considering that these ids are used to form URLs, I think this should probably be transformed into ascii characters. This is not documented in the schema, but at least we already do that with the Brazilian data (e.g. "Advocacia-Geral da União" becomes "br/advocacia-geral-da-uniao", without the accented characters).

We can do that easily with the python sluggify library.

For example:

In [1]: from slugify import slugify

In [2]: slugify('ação')
Out[2]: 'acao'

In [3]: slugify('भिमसेनथापा-गाउँपालिका')
Out[3]: 'bhimsenthaapaa-gaaunpaalikaa'

So "भिमसेनथापा-गाउँपालिका'" would get the id "np/bhimsenthaapaa-gaaunpaalikaa".

What do you think?

nikeshbalami commented 2 years ago

Thank you @augusto-herrmann, I haven't validated the data. Yes, let's validate using the Frictionless Data Package table schema.

Also, the idea of transforming the id column into ASCII characters is great. I was thinking about it, but later kept it in Nepali as some of the old datasets has Nepali slug like this: http://publicbodies.org/np/%E0%A4%B2%E0%A5%8B%E0%A4%95-%E0%A4%B8%E0%A5%87%E0%A4%B5%E0%A4%BE-%E0%A4%86%E0%A4%AF%E0%A5%8B%E0%A4%97

Can you help transform it?

augusto-herrmann commented 2 years ago

Of course I can. The code in the example above already takes care of this transformation. You have to pip install python-slugify beforehand, though.

If you have difficulty, I can update this PR later when I find some time to do it.

By the way, do you think that changing the unicode ids already present in the database would be a problem?

nikeshbalami commented 2 years ago

Sure @augusto-herrmann, please update the PR wherever possible.

Regarding the second question, I am not sure, but it probably will not.

augusto-herrmann commented 2 years ago

PR #142 changes the data validation system to Frictionless Repository. As soon as we merge that I intend to look into this one.

augusto-herrmann commented 2 years ago

@nikeshbalami, I don't think I can update this PR, so I'll just accept it and update right afterward with the changes:

clean some email and URL fields to pass schema validation
make ids be slugs that could be used in URIs

After that, we still need to add the import script and schedule its execution. How often do you think would be adequate for this data? For comparison, for the Brazil data, we have been updating it once a week.