DFID / aid-platform-beta

Previous version of the DFID Development Tracker that uses: MongoDB, Neo4J & Elastic Search to host IATI data
MIT License
5 stars 7 forks source link

Complicated app? #185

Open ckreutz opened 10 years ago

ckreutz commented 10 years ago

Hello! I consider using your app, but I somehow have difficulties understanding the infrastructure and concept. Frankly I do not understand why it is so complicated. Why do you need Java, Ruby and Node.js? Why don't your work with an existing standard framework?

For example I have developed this project: https://github.com/ckreutz/offene-entwicklungshilfe Simple flask application, that generates data from a sqlite db to plain HTML for upload to a server.

Why don't you work with something simple such as jeykyll? Why isn't the data parsing a completely separated task, so I have the freedom to chose my own frontend or vice versa. I guess I do not understand your concept in detail, but if you expect some forks, it should be way easier.

johnadamsDFID commented 10 years ago

Hi Christian

Thanks for raising this issue.

I agree that the app is complex, but there are firm reasons for the technology choices. The app consists essentially of a Loader that pulls in IATI XML files from the Registry into a Neo4J graph database. The Aggregator then pulls data into a mongo database, aggregating information where necessary. The Loader/Aggregator layer is written in Scala.

The front-end static HTML pages are deployed using a Middleman build, which requires Ruby.

Finally, Node.js is a necessary requirement to run Tilestream maps.

We recognise that the current app is complex, and we have also published early code for a replacement architecture that is currently in development - you can see that code on https://github.com/DFID/devtracker-new. This new architecture has completely separate Synchroniser (data), API and Site layers.

One further future possibility is to develop a Site layer on top of the API provided by the IATI DataStore (https://github.com/IATI/iati-datastore) and we will be working with the IATI team to test this.

I hope this helps.

johnadamsDFID commented 10 years ago

Oh, I'm also happy to receive any further suggestions for streamlining the app. We don't think it is finished yet!

ckreutz commented 10 years ago

Hi John,

thanks for your quick reply. I know the XML parsing alone is a tricky task. Where would we stay with this movement if we had plain CSV dumps? Instead each app has to find a way to extract this XML data. That was a huge strategic mistake but that is already discussed elsewhere. But I do not understand why for example you need Neo4J for such simple diagrams. That can be done with JSON and many great JS libraries. And then on top of it you have MongoDB instead of a simple SQL DB, which could be easily exported to Sqlite or others, so people have more freedom to change the infrastructure of the app. Why do I need a whole server/DB set up for a vizualization of a few thousand rows of data? Sorry for me this looks like over-engineering. Anyway thanks for your explanation. The ruby parser sounds interesting and I will take a closer look into it. But I am afraid I am not willing to invest time setting up the whole infrastructure for a data parsing and vizualization project. We need way easier apps to get this movement going.

Best Christian

ckreutz commented 10 years ago

Oh forgot to say the new version looks simpler and better. I will follow that.

johnadamsDFID commented 10 years ago

Just a warning that the new version is not fully developed - particularly the API and Site pages. There's quite a bit of work to do, particularly with the XQuery into the XML.

Have you looked at the IATI Datastore? That converts XML into Postgres and puts an API on top. Mark Brough has done an app that puts a Devatracker front end on top.

ckreutz commented 10 years ago

Thanks for the hint with Postgres. I will check that. Yes, I know Mark is doing here some great work. ;-)

kouphax commented 10 years ago

Hi Christian,

I'm the original developer of devtracker - hopefully I can answer some of your questions. Like John I admit the project is more complex than it needs to be. It evolved fast under constant changing needs and we are addressing the complexity with the new architecture. What was once fit for purpose is less so now. I've included some of my comments inline.

thanks for your quick reply. I know the XML parsing alone is a tricky task. Where would we stay with this movement if we had plain CSV dumps? Instead each app has to find a way to extract this XML data. That was a huge strategic mistake but that is already discussed elsewhere.

Not aware of any IATI discussions as I'm not part of IATI or DFID but CSV is an incredibly flat structure whereas IATI data, in its current form, has the potential to be considerably more structured. In my previous reserach around CSV representation the result for most queries was either incredibly large amounts of duplication or an increased number of requests. Admitedly this was only a cursory exploration of IATI and CSV.

But I do not understand why for example you need Neo4J for such simple diagrams.

We needed a way to take the entire IATI dataset (thousands of individual XML files) or just a subset and query them as a cohesive whole. The site output is a distillation of the data contained within devtracker.

That can be done with JSON and many great JS libraries.

I think you are confusing the chart output on the static site with actual backend persisted data. So devtracker is capable of holding the current IATI dataset (~2-3GB of data). I'd be keen to know what JSON libraries support querying such a large dataset. Secondly how would you get the XML data from thousands of IATI formatted XML files into JSON? At a high level this is exactly what we are doing. Albeit in a slightly more round about way than we want to right now.

And then on top of it you have MongoDB instead of a simple SQL DB, which could be easily exported to Sqlite or others, so people have more freedom to change the infrastructure of the app.

Neo4J houses our raw data set collected from all the IATI XML files and allowing us to query that data in a centralised way. This data however isn't in a format that is useful for consumption by the site. There is a certain level of aggregation that needs to happen (country budgets, project budgets and spends, project types etc.). So we take the data from Neo4J and perform certain aggregations that we store in MongoDB. MongoDB was chosen because for this sort of read-only variable structured data its easier to put in and easier to query. SQL databases would require schema modelling which we don't/didn't need the overhead of. As for the export - Mongo supports exporting datasets but the idea behind DevTracker is that you can actively decide what datasets from IATI you want to load. Agreed, its a bit more coupled than it nees to be.

Why do I need a whole server/DB set up for a vizualization of a few thousand rows of data?

It's not aimed at a few thousand rows of data. While the outputed aggregation has a few collections that have a few thousand documents in them there is a lot more going on under the hood. That said, Neo4J runs embedded so you dont need to install anything and Mongo is just sudo apt-get install mongo and its done.

Sorry for me this looks like over-engineering.

If visualisation is all you need then yes it may be but DevTracker has a bigger remit than the generated DevTracker site (i.e. a unified queryable API across all IATI data that supports the loose nature of the schema and tolerant of inconsistencies in published data)

Anyway thanks for your explanation. The ruby parser sounds interesting and I will take a closer look into it. But I am afraid I am not willing to invest time setting up the whole infrastructure for a data parsing and vizualization project. We need way easier apps to get this movement going.

DevTracker is a one stop shop for many things and this is where the complexity lies. There are 2 things that could be done here.

  1. Turn DevTracker into a SaaS offering where people could register, specify the datasets they want and have a site generated for them without the background configuration problems. Its possible but the low data quality in the published registry and loose schema of IATI makes this difficult (there is a lot of code in DevTracker specifically just to work around differing assumptions in published data)
  2. Extract bit of DevTracker and make them completely isolated. Again anything that makes inferences on the data will never be perfect for others needs until such times as IATI becomes more strict around how data is published.

I was thinking the other day of a DevTracker lite. You'd clone a repo, drop XML files into the xml folder of the repo and run the conatined app (Rails or Sinatra for example) and you;d get a site driven from that data - bit like your approach except more generic and using IATI data instead of a transformed dataset.

Hope that helps clear up some of the thinking.

bill-anderson commented 10 years ago

Christian, I see you are using OECD CRS data. This dataset contains historical, annually aggregated statistics. The IATI standard provides current activity-based data with more granularity of financial transactions, and a range of other data - including forward-looking budgets, sub-national geography, condition, results, multiple sector breakdowns, etc.

CRS data fits naturally into a flat file. IATI data doesn't - hence the choice to use XML.

Given that Germany has now started publishing data to IATI, and plans to improve the granularity and quality of this data (see http://www.bmz.de/en/zentrales_downloadarchiv/wege_und_akteure/Transparency/Germany_Common_Standard_Implementation_Schedule.xls) I would have thought it would be in your interests to consider using the IATI data in addition to that from the CRS.

markbrough commented 10 years ago

Hey @ckreutz

There was some discussion about why XML was chosen over CSV on the iati-technical list a while ago: https://groups.google.com/d/msg/iati-technical/ga_4UjObP2Y/jDr-BRbV66YJ

I basically agree with this. I think it's good that the complex nature of projects can be represented in a single file rather than having to use a series of relational CSVs (which I don't think would actually be much easier to use and possibly has some significant disadvantages in terms of consuming). This allows many results, many sectors, many documents etc.

But yeah there's definitely a need to lower the technical barrier to consuming the data by non-geeks - I think the data store and devtracker are both good examples of that in different ways. And there's also a need to improve the consistency and quality of the data, which we're working on (as are the secretariat)!

ckreutz commented 10 years ago

Hello all!

Thanks so much for your comments. :-) I am sure there are many good technical answers why CSV was not feasible for this data standard and I agree to many points you raise. But I am also a aid worker. I worked for example two years in Cairo, Egypt, where I got to know the problem at first sight. Nobody knew what the other organizations are doing. There was no overview and a clear waste of resources. But what I needed was just a list simple list (e.g. CSV) to see what are the projects and who could I potentially corporate with. I needed something simple and not a complex XML format. I will write these days a blog post about it, that a major problem in my opinion is that often technology comes first and user needs second. Data portals are made for 1% geeks and not the 99% normal people. Put a non-tech savvy aid worker (probably 95%) in front of a data portal. They will not understand much of it. I guess they have to wait for good apps to developed.

@bill-anderson of course I am aware about the difference of IATI and CRS. ;-) But again as an aid worker I need to know what happened the past years. The fluctuation is so big in organization that after 3 years you often have no personell available to tell you what happened before. Consultants write a report for the same issues just a few years ago. So I believe CRS is very valuable until IATI data has data reaching some years back. I am working on a platform integrating the two data sets these days. By the way the German data is still disappointing as BMZ left out NGO funding etc. I was in meetings with them before and at least they got now started. I will integrate it in our project soon.

@kouphax thanks so much for the explanation. Sorry if I have understood the dev tracker concept then partly wrong, but maybe that should be clearer highlighted in the documentary. If you can deal with the whole data set with this application, then it makes much more sense to me. For the price that I can handle the whole IATI data set, the setup is well worth the effort. The light version definitely sounds good if I see people using Openspending to upload data sets. Although I wonder here again what are the actual needs and how can this data help aid worker in their daily work or empower beneficiaries to held them more accountable? Lots to learn.

Thanks again for all your replies and clarification. Looking forward to the new version of the dev tracker.