OpenAgFunding / development

For developing responses to current gaps in the availability and usability of open data on funding for agriculture and food security.
0 stars 2 forks source link

Using auto-tagging of IATI Activities by attached documents and description texts #1

Open timgdavies opened 8 years ago

timgdavies commented 8 years ago

I've been exploring use of AgroTagger which will automatically classify text documents against AgroVoc.

In a test, I've got this working to apply tags to the Landscape Analysis report which it tagged with the terms:

  1. research
  2. landscaping
  3. investment
  4. landscape
  5. publications
  6. fisheries
  7. agriculture
  8. forestry
  9. surveys
  10. scientists

Building on this we should run a test against some IATI data and documents to see the quality of tagging against project documents.

mikecastro commented 8 years ago

Tim, I was also thinking that this could be useful for the data aggregation piece that we discussed during the meeting. Perhaps, we could also test the AgroTagger on some of the individual country aid transparency portals like ForeignAssistance.gov, OpenAid.se, and the such to see if there is more publicly available information that what is included in their respective IATI files.

stevieflow commented 8 years ago

ACTION: build up a list of IATI activities that include the document-link element.

@timgdavies if we did this, and then pushed the resultant documents into a central place, would this be good for testing?

timgdavies commented 8 years ago

I saw the work in #17 which is great. I couldn't get agrotagger to work from a local directory though - so currently running against Rolf's list to scrape and convert all the docs I can, and will see if that runs successfully...

timgdavies commented 8 years ago

I've run all the documents in Rolf's list against the AgroVoc autotagger.

9,000 of the results are loaded into an Ontowiki install (2) at http://84.45.8.202/ accessed from the AgroTagger collection. You can browse through Crawled Documents to see the docs and the tags applied.

At a very quick test, a few learnings:

Will investigate the tagged corpus further to see what more we might learn about potential for auto-tagging.

dwalker101 commented 8 years ago

Perhaps we could manually classify some documents and then compare against the AgroTagger classifications to assess validity/ accuracy? This could also potentially help for machine learning -- building a database of correct, verified classifications could maybe make future predictions more accurate?

timgdavies commented 8 years ago

Having a training set of documents would be a great idea.

This also raises a possible feature request: if it was possible to distinguish in published data between auto-applied tags, and human-applied tags, it may be possible to improve the training of machine-learning tools over time.

rolfkleef commented 8 years ago

@timgdavies I don't seem to see Crawled documents at that Ontowiki install you mention?

While working on #17 I had a quick look at the Document Category codes to see if it made sense to look only at particular types of documents, but that didn't look too promising at first glance.

About a quarter of the document links have a field which may help with language detection -- or proof that it is not a reliable field. Languages in my data: 2, af, ar, "da", da, de, "en", en, EN, English, "es", es, "fr", fr, French, ja, nl, Portuguese, pt, ru, Spanish, sv Around 85% of the links is declared as English.

timgdavies commented 8 years ago

Sorry - try with this link: http://84.45.8.202/index.php/model/info/?m=http%3A%2F%2F84.45.8.202%2Findex.php%2FAgroTagger%2F