OpenAgFunding / development

For developing responses to current gaps in the availability and usability of open data on funding for agriculture and food security.
0 stars 2 forks source link

ACTION: build up a list of IATI activities that include the document-link element #17

Open stevieflow opened 8 years ago

stevieflow commented 8 years ago

Several stories point towards some analysis use of existing (Ag related) IATI activities that include the `` element, with some documents accessible / available

The task here is to :

stevieflow commented 8 years ago

@rolfkleef

rolfkleef commented 8 years ago

I created a gist at https://gist.github.com/rolfkleef/dff381cb738adf2a42bf1604046e76b4 with a list of document URLs and a subset of IATI data with the minimum info:

iati-activities - iati-activity - {iati-identifier, reporting-org, sector, document-link}

rolfkleef commented 8 years ago

I've got some 14GB in 11254 documents, selecting potential "real documents" based on the declared format:

format=("application/pdf","application/vnd.ms-excel","application/vnd.openxmlformats-officedocument.spreadsheetml.sheet","application/msword","application/vnd.ms-word.document.macroEnabled.12","application/vnd.ms-word.document.macroEnabled.13","application/vnd.openxmlformats-officedocument.wordprocessingml.template"

Available for inspection or download at https://www.we-collaborate.net/owncloud/s/FwUwIgr88wC8wFM

There's also the extensive log of attempting to download them all: quite a few 404s, etc. Also, there are quite a few scanned documents as PDFs (contracts, etc).

The pseudo-IATI file with sectors, reporting orgs, etc contains links that show up as files in folders based on domain: www.domain.org/folder/item.pdf

@stevieflow: I can add the scripts somewhere in this (or another repository); and is there a place to store the corpus?

stevieflow commented 8 years ago

Hi @rolfkleef great, thanks

Yes, please to:

I can add the scripts somewhere in this (or another repository); and is there a place to store the corpus?

A new repo in this org?

rolfkleef commented 8 years ago

Ok, except github recommends keeping a repo under 1GB: https://help.github.com/articles/what-is-my-disk-quota/ so the documents need to live somewhere else. They can stay on my owncloud for now.

If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down.

I'll put the files from that gist into a repository!

mikecastro commented 8 years ago

@stevieflow @rolfkleef from my experience with working with budget documents, it may be worth keeping this document repository. Governments delete documents or replace the documents and we lose the contents of the documents. Would this also be a part of the documentation site that we discussed in the June 9th meeting?

timgdavies commented 7 years ago

We've postponed documentation site for the moment - but unlikely to host IATI documents as part of that.

We did use the documents to test some automated coding of documents against AgroVoc which brought mixed results.