Open timgdavies opened 8 years ago
Tim, I was also thinking that this could be useful for the data aggregation piece that we discussed during the meeting. Perhaps, we could also test the AgroTagger on some of the individual country aid transparency portals like ForeignAssistance.gov, OpenAid.se, and the such to see if there is more publicly available information that what is included in their respective IATI files.
ACTION: build up a list of IATI activities that include the document-link
element.
@timgdavies if we did this, and then pushed the resultant documents into a central place, would this be good for testing?
I saw the work in #17 which is great. I couldn't get agrotagger to work from a local directory though - so currently running against Rolf's list to scrape and convert all the docs I can, and will see if that runs successfully...
I've run all the documents in Rolf's list against the AgroVoc autotagger.
9,000 of the results are loaded into an Ontowiki install (2) at http://84.45.8.202/ accessed from the AgroTagger collection. You can browse through Crawled Documents to see the docs and the tags applied.
At a very quick test, a few learnings:
Will investigate the tagged corpus further to see what more we might learn about potential for auto-tagging.
Perhaps we could manually classify some documents and then compare against the AgroTagger classifications to assess validity/ accuracy? This could also potentially help for machine learning -- building a database of correct, verified classifications could maybe make future predictions more accurate?
Having a training set of documents would be a great idea.
This also raises a possible feature request: if it was possible to distinguish in published data between auto-applied tags, and human-applied tags, it may be possible to improve the training of machine-learning tools over time.
@timgdavies I don't seem to see Crawled documents at that Ontowiki install you mention?
While working on #17 I had a quick look at the Document Category codes to see if it made sense to look only at particular types of documents, but that didn't look too promising at first glance.
About a quarter of the document links have a 2, af, ar, "da", da, de, "en", en, EN, English, "es", es, "fr", fr, French, ja, nl, Portuguese, pt, ru, Spanish, sv
Around 85% of the links is declared as English.
Sorry - try with this link: http://84.45.8.202/index.php/model/info/?m=http%3A%2F%2F84.45.8.202%2Findex.php%2FAgroTagger%2F
I've been exploring use of AgroTagger which will automatically classify text documents against AgroVoc.
In a test, I've got this working to apply tags to the Landscape Analysis report which it tagged with the terms:
Building on this we should run a test against some IATI data and documents to see the quality of tagging against project documents.