Signsofliteracy / Signoff

Tools for the study of historical literacy
http://signsofliteracy.org/
8 stars 0 forks source link

WORKSHOP SUB-TOPIC: Designing a Kaggle competition to do a POC on preliminary datasets of markes, initials & signatures #20

Open Addaci opened 6 years ago

Addaci commented 6 years ago

@gbohner has suggested that the Signs of Literacy community set up a Kaggle research competition to perform a Proof of Concept on preliminary datasets of markes, initials & signatures. Gergo states that "The Kaggle community is adept at applying a range of potential algorithms quickly to solve similar problems, and it will give you a good idea of 1) how well AI can replace the volunteers on your task(s), and a starting point for 2) what are the kind of algorithms you might want to improve upon in a collaboration with a research lab."

Addaci commented 6 years ago

Kaggle competitions are based on supervised learning, and require both a training set ofdata for which the target predicted value already has a ground truth outcome available. Kaggle will use this ground truth as an answer key to score the accuracy of participants’ submissions in real time over the course of the competition. For more information see https://www.kaggle.com/host/business and https://www.kaggle.com/host/research.

Background

Some of the best work done on Kaggle has been on research competitions. They present new, unique machine learning datasets and use-cases to the community, and we want to make them available to Kagglers on a more regular basis.

Historically, the process of making research competitions available has been ad-hoc and highly circumstantial. Going forward, we want to make it easier for motivated research, academic, and non-profit groups to host open source challenges on Kaggle.

Opportunity

Kaggle will partner with organizations to host up to 5 pro-bono research competitions a year. We are asking interested organizations to submit a brief proposal for Kaggle’s consideration.

Requirements

To be eligible to host a research competition, the host organization must:

Selection Criteria

Kaggle will select research hosts on a rolling basis over the entire year. We will opportunistically reach out to prospective hosts based on the content of their application forms. Primarily, we will be looking for:

Fit - Is the project a cleanly structured supervised learning problem? Is there ample data available to support a successful competition?

Feasibility - Will it be logistically possible to make the data available for the competition? Is the hosting organization in a position to facilitate a successful competition?

Impact - What is the scope of the problem this competition addresses? How impactful would the open source solutions be on the industry this competition supports?

Kaggle and the prospective host will go through a vetting process before setting up and launching a competition. This vetting process will include a review of the dataset, some premodeling of the problem, and selection of an evaluation metric. The host organization will also need to sign a data-use agreement and be ready to transfer Kaggle the full value of the prize pool upon launch of the competition. Kaggle will facilitate the payment of prizes to winners at the competition close.

Addaci commented 6 years ago

We have posted a request for help on the Kaggle Twitter account.

Addaci commented 6 years ago

We submitted on Monday, April 31st 2018 a draft proposal to Kaggle for a Kaggle hosted pro bono funded Kaggle Research Competition. As background, Kaggle has access to a community of ca. 500,000 AI/machine learners/data scientists. It was set up as a crowdsourced approach to R&D in machine learning, and was acquired in March 2017 by Google. It is being maintained as a separate branded service within Google Cloud.

We have also publicised the proposed Kaggle Research Competition idea on LinkedIn and Twitter and it is attracting interest

https://www.linkedin.com/pulse/proposed-signs-literacy-kaggle-research-competition-2018-greenstreet/

We have received a quick and positive response from Maggie Demkin of Kaggle:

"Hello Colin,

Thank you for reaching out to Kaggle, this sounds like a very interesting opportunity. It makes sense to set up time to review your idea in more detail. During this meeting I can tell you more about Kaggle and what we do. I am located in California (Pacific Time), what are some dates/time in the next few weeks that work for you?

Thanks, Maggie"

Addaci commented 6 years ago

@mhailwood @voetnoot @Giovanni1085 @gbohner @MvanErp Positive initial response from Dr Peter Bloomfield, AI/ML Engagement Manager, at Digital Capital in London, to my approach seeking to interest him and Machine Learners at the Digital Catapult in our proposed Kaggle competition:

"Hi Colin,

It was nice to meet you briefly, thanks for getting in touch! I'll have a think about the Kaggle competition and discuss it with some more of my team and get back to you! initially it sounds like good fun, so it would mostly be a case of being able to allocate resources to it

All the best

Peter Peter Bloomfield PhD AI/ML Policy + Engagement Manager Check out the Machine Intelligence Garage! www.migarage.ai

Digital Catapult 101 Euston Road London, NW1 2RA"

Addaci commented 6 years ago

Video call arranged with Maggie Demkin of Kaggle, Tuesday May 8, 2018 5pm – 5:45pm London "to provide you with a 20-30 minute overview on Kaggle and to answer any questions you have about hosting a competition". Also invited to call if available and interested @mhailwood @voetnoot @gbohner

Addaci commented 6 years ago

Successful video call with Maggie Demkin of Kaggle on Tuesday, May 8th, 2018:

"Hello Colin,

Thanks again for meeting with me today and sharing your idea for a Kaggle competition. I really enjoyed our conversation. We would see your competition on signs of literacy could be a good research competition for Kaggle. As we discussed I have attached a contract template that you can use to fill in information about your organization. There are three pieces to this document:

When you have a chance and can send us a data sample and a data dictionary of your proposed dataset, that would be great and would allow us to better assess the competition structure.

Thanks again, I look forward to hearing from you. Maggie Demkin"

Addaci commented 6 years ago

Email from Colin, Wednesday, May 8th, 2018

"Hi Mark and Mark, and members and friends of the Signs of Literacy community,

I am just off a video call with Maggie Demkin of Kaggle. Maggie has confirmed strong interest in Kaggle hosting a Signs of Literacy Research Competition on a pro bono basis, to run from November 2018 to early/mid-January 2019 https://www.kaggle.com/host/research

We have agreed that a research competition built around a Signs of Literacy image data set of markes, initials and signatures is well suited to Kaggle and fits Kaggle's selection criteria for pro bono competitions. We have also agreed that a competition running from November 2018 to early/mid-January 2019 would work well with Kaggle's pipeline of competitions, and would play well to participants using the Christmas break to finalise their competing algorithms.

I propose to start building the training dataset and test dataset in July and aim to have complete by early/mid-October 2018, working with MarineLives volunteers, and possibly with an assistant project manager from the University of Warwick. Mark Ponte is making available a set of 7,000 images, which have already been partially tagged, from the Amsterdam notarial archives, and in addition we will draw on 6,000 images containing markes, initials and signatures already on the MarineLives wiki, sourced from documents at the TNA. Colin will coordinate with Mark Hailwood and Mark Ponte to determine grading standards by which to grade and tag markes, initials and signatures within their separate classes according to sophistication of execution.

I am pasting below my pitch to Kaggle to remind you of the proposal. The opproposal is also summarised on an April 30th, 2018 LinkedIn posting:

https://www.linkedin.com/pulse/proposed-signs-literacy-kaggle-research-competition-2018-greenstreet/

We have agreed three next steps:

(1) Signs of Literacy will form a small steering group for the competition, and will also provide a technical lead to interface with Kaggle, as well as me as the general lead [Colin to discuss with Giovanni and Gergo]

(2) Signs of Literacy will put together a small sample data set, consisting of tagged images (n = ca. 200) for the Kaggle data team to look at and to provide feedback [Colin to coordinate]. The Kaggle data team will advise on the appropriate size of the training and test datasets we will need to create.

(3) Kaggle will initiate their contracting process, on the basis of a pro bono provision of services by Kaggle, and MarineLives/Chronoscopic Education (for Signs of Literacy) going out and raising ca. $25,000 prize money (which will go entirely to participants, and not to Kaggle overhead). Chronoscopic Education (as a to be registered UK charity) will be the legal contracting party with Kaggle http://chronoscopic.org

This is now a very real initiative, and I am keen to hear back from individuals and organisations as to whether they wish to be involved, and how they might be involved. Involvement could range from volunteering time, forming a Kaggle competition team of machine learners, contributing data, and/or contributing to the prize money as a sponsor or partner.

Mark Ponte has already confirmed his willingness to participate in the small steering group we are forming, which ideally would combine historical, archival and technical expertise, spanning UK and the Netherlands.

With best wishes

Colin"

Addaci commented 6 years ago

Copy of Kaggle Competition Services Project Order, obtained from Maggie Demkin of Kaggle: https://github.com/Signsofliteracy/Signoff/blob/master/Kaggle%20Competition%20Services%20Project%20Order%20and%20Agreement.docx

Addaci commented 6 years ago

@mhailwood @voetnoot @Giovanni1085 @gbohner @BarbaraMcG @jellevanlottum @MvanErp Half hour telephone discussion Thursday, May 17th, 2018, between Colin Greenstreet and Dr Peter Bloomfield, AI/ML Engagement Manager, Digital Catapult, about potential involvement of Digital Catapult as participant, partner/sponsor, and part funder of Signs of Literacy Kaggle Research Competition prize

Addaci commented 6 years ago

@mhailwood @voetnoot @Giovanni1085 @gbohner @BarbaraMcG @jellevanlottum @MvanErp Slides and full text of presentation by Colin Greenstreet at IIIF Washington DC conference, morning session, Thursday, May 24th, 2018. The presentation lays out a vision for the application of machine learning to the stduy of historical litearcy as one of the technologies available to the Signs of Literacy community. We will explore this at a more detailed operational level at the Stadsarchiefworkshop on Tuesday, June 5th, 2018

https://github.com/Signsofliteracy/Signoff/blob/master/Chronoscopic_Education_IIIF_Speakers_Notes_27052018_FINAL.pdf

https://github.com/Signsofliteracy/Signoff/blob/master/IIIF-Chronoscopic_Education_Presentation_26052018_PUBLISHED_SLIDES_SPEAKERSNOTES_PDF.pptx