Open stevieflow opened 8 years ago
@rolfkleef
I created a gist at https://gist.github.com/rolfkleef/dff381cb738adf2a42bf1604046e76b4 with a list of document URLs and a subset of IATI data with the minimum info:
iati-activities - iati-activity - {iati-identifier, reporting-org, sector, document-link}
I've got some 14GB in 11254 documents, selecting potential "real documents" based on the declared format:
format=("application/pdf","application/vnd.ms-excel","application/vnd.openxmlformats-officedocument.spreadsheetml.sheet","application/msword","application/vnd.ms-word.document.macroEnabled.12","application/vnd.ms-word.document.macroEnabled.13","application/vnd.openxmlformats-officedocument.wordprocessingml.template"
Available for inspection or download at https://www.we-collaborate.net/owncloud/s/FwUwIgr88wC8wFM
There's also the extensive log of attempting to download them all: quite a few 404s, etc. Also, there are quite a few scanned documents as PDFs (contracts, etc).
The pseudo-IATI file with sectors, reporting orgs, etc contains links that show up as files in folders based on domain: www.domain.org/folder/item.pdf
@stevieflow: I can add the scripts somewhere in this (or another repository); and is there a place to store the corpus?
Hi @rolfkleef great, thanks
Yes, please to:
I can add the scripts somewhere in this (or another repository); and is there a place to store the corpus?
A new repo in this org?
Ok, except github recommends keeping a repo under 1GB: https://help.github.com/articles/what-is-my-disk-quota/ so the documents need to live somewhere else. They can stay on my owncloud for now.
If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down.
I'll put the files from that gist into a repository!
@stevieflow @rolfkleef from my experience with working with budget documents, it may be worth keeping this document repository. Governments delete documents or replace the documents and we lose the contents of the documents. Would this also be a part of the documentation site that we discussed in the June 9th meeting?
We've postponed documentation site for the moment - but unlikely to host IATI documents as part of that.
We did use the documents to test some automated coding of documents against AgroVoc which brought mixed results.
Several stories point towards some analysis use of existing (Ag related) IATI activities that include the `` element, with some documents accessible / available
The task here is to :