Indicia-Team / google-archive

Automatically exported from code.google.com/p/indicia
0 stars 0 forks source link

Natural language generation - narratives on receipt of a record #493

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
As requested by Bill Sutherland during the NBN conference. When a record is 
submitted, the reply is a rather brief thanks for the record. A much nicer 
reply would include a narrative putting the record in the context of other 
records of the same species and species group. I didn't manage to note Bill's 
exact example, but Helen Roy has provided the following examples:

Thank you so much for your XXXX ladybird record.  This is the XXXX record of 
this species that we have received in 20XX.  However, it is the X record for 
the 10km square in which your sighting was reported.  It is the X most 
frequently recorded ladybird - there are 47 species of ladybird in the UK.  
Other species of ladybird that have been recorded in your locality (within 10km 
of your sighting) in the last month include: ...

Some possible extras...

This is a new county record for this species.

This is a new 10km record for this species.

Original issue reported on code.google.com by johnvanb...@gmail.com on 18 Nov 2013 at 9:46

GoogleCodeExporter commented 9 years ago
On investigation, the natural language generation aspect of this does not 
appear too tricky. The main difficulty will be the performance overhead of 
obtaining the appropriate summary data from the database on the fly - any delay 
to submission would negate the positive effect of this modification.

Original comment by johnvanb...@gmail.com on 18 Nov 2013 at 9:48

GoogleCodeExporter commented 9 years ago
David Roy's comment:
I had a thought about messages generated when records submitted - could they be 
carried forward to verification system and pre-populate the email message for 
verifiers when they select 'email recorders'?  Or does this create a longer 
term storage problem?

My thoughts:
I suspect that storing the messages generated might require a fair bit of disk 
space, but it is not going to go into an area of the db that it would be 
important to keep loaded in memory unlike the occurrences cache table. 
Therefore I don't think that storing the messages is a particular problem.

Original comment by johnvanb...@gmail.com on 19 Nov 2013 at 9:15

GoogleCodeExporter commented 9 years ago
The trick is going to be working out how to extract all the summary data from 
the database to support the generation of these statements quickly enough to 
not have a detrimental effect on record submission. I think it might be a good 
idea to give the existing “thanks” message, with an extra link “find out 
more about your record...”. When clicked, this can go to the database and 
generate the language response, so a slight delay will be acceptable. This also 
has the advantage of not stating the obvious to seasoned recorders all the time.

Comment from Peter Brown:
That sounds like a very good suggestion (i.e. to have this as a two-stage 
process). And Helen I like your choice of wording below. It'll be excellent to 
give recorders the opportunity to learn so much more about their record (if 
they want to...) and is bound to encourage further recording.

Original comment by johnvanb...@gmail.com on 19 Nov 2013 at 6:49

GoogleCodeExporter commented 9 years ago
I just got back from a conference which was largely about natural language 
generation so I thought I might give my 2 cents.

The examples you give are interesting and likely going to interest recorders 
(or at least those new to recording). However they don't make to most of the 
potential of NGL. We should aim not just to interest the recorder but to 
educate them, focus them and challenge them so that instead of getting more 
records, we get more records of higher quality.

One example from the conference was BeeWatch 
(http://homepages.abdn.ac.uk/wpn003/beewatch/index.php?r=user/auth). Here, a 
part of the response message reads something like:

'Your submitted your record as <species a> but it is in fact <species b>.  You 
correctly identified <trait 1>, <trait 2> and <trait 3> which are shared by 
these species. The traits you need to look out for to distinguish these are 
<trait 4> and <trait 5>. For <trait 4> <species a>, is <attribute 4a> and for 
<species b> it is <attribute 4b>. As for <trait 5>, <species a> is <attribute 
5a> and <species b> is <attribute 5b>'

The reserchers showed that this (as opposed to just a thank you message) 
resulted in a significant improvement in ID skill and a big improvement in 
volunteer retention. You also mention issues of extracting data to fill in the 
blanks. Using this method requires very little data, simply a table of traits 
for each species in your group of interest (this would be a big hurdle for some 
groups e.g. Diptera, but easy for others e.g. Ladybirds).

While this improves ID skills we can also improve spatial coverage by 
motivating recorders to record where we need it most:

'Thanks for your record of <species a> form <location>. It is likely that this 
species is also in <other location> (<link to map of other location>), but we 
have not got any records from <other location>. If you are able to send in a 
record of <species a> from <other location> that would really help our 
research.'

'Thanks for your record of <species a> form <location>. We also think <species 
b> (<link to info on species b>) is likely to be in <location> so please keep 
an eye out for it next time your out.'

Some people like a challenge, or positive reinforcement, so you could think of 
messages like:

'You just submitted your longest list: <length>. Complete lists, where you 
record everything you see, are great for answering research questions, thanks!'

'You just recorded you <milestone number> species, congratulations! Here is to 
the next <milestone number>'

'You record for <location> is important because <location> is poorly recorded 
(it is in the bottom <location percentile>). Recording in these areas really 
helps improve the quality of our data'

I think that there are definitely technical challenges to do this type of work 
but I would argue that just as difficult will be designing responses that make 
the most of the opportunity NGL affords us. 

I should also say that I spoke to the guys at Aberdeen (from the computer 
sciences department) who are behind a lot of their NLG work and they are keen 
to foster collaborations, and this might be best achieved by taking on one of 
their masters students whose projects start in January.

Original comment by tomaugus...@googlemail.com on 13 Jun 2014 at 9:15

GoogleCodeExporter commented 9 years ago
If you develop this, there must be an opt out button.  I really appreciate 
getting messages from verifiers, but would definitely not want to receive 
automated messages.      
Be wary of including statements like 'this is a new county record' or this is a 
new 10km' record, because presumably this would only be based on the data held 
on the NBN database and Indicia warehouse.  There have been recent instances of 
recorders publicly claiming they've got 'firsts' because there are no records 
on the NBN Gateway, only to have it pointed out to them that there are records 
in the literature.

Original comment by PaulaNBN@gmail.com on 5 Sep 2014 at 6:38