RDBinns / datactrl

Making the UK register of data controllers more useable
Apache License 2.0
2 stars 1 forks source link

Add cleaned up, reformatted example xml #1

Closed mrchrisadams closed 10 years ago

mrchrisadams commented 10 years ago

Hi Reuben,

I've added a slightly cleaned of the sample xml output, so it's a slightly more clear what information is in the Nature_of_Work_description node.

Because this appear to be basic html, I'm thinking it might be possible to parse the contents with something like Beautiful Soup, add the structure back in, and store it somewhere, either in a database, or a load of flat files somewhere.

RDBinns commented 10 years ago

If that's possible that would be amazing, because then we could have up-to-date data (although would still lose the differentiation of entries by purpose which was present in the earlier format). I have a little bit of experience with beautiful soup and scraperwiki but haven't attempted anything as big as this before.

On Wed, Jan 22, 2014 at 12:15 PM, Chris Adams notifications@github.comwrote:

Hi Reuben,

I've added a slightly cleaned of the sample xml output, so it's a slightly more clear what information is in the Nature_of_Work_description node.

Because this appear to be basic html, I'm thinking it might be possible to parse the contents with something like Beautiful Souphttp://www.crummy.com/software/BeautifulSoup/, add the structure back in, and store it somewhere, either in a database, or

a load of flat files somewhere.

You can merge this Pull Request by running

git pull https://github.com/mrchrisadams/datactrl master

Or view, comment on, or merge it at:

https://github.com/RDBinns/datactrl/pull/1 Commit Summary

  • Add cleaned up, reformatted example xml

File Changes

  • A cleaned_new_format.example.xmlhttps://github.com/RDBinns/datactrl/pull/1/files#diff-0(114)

Patch Links:

— Reply to this email directly or view it on GitHubhttps://github.com/RDBinns/datactrl/pull/1 .