Genus level data collation

johnvanbreda commented 8 years ago

Raised by Paul: I’m not sure how much more work this will entail however is it possible to group together individual species recorded on the same night to give genus totals also? This will give us a much larger dataset for comparison if users stratify at the genus level. For example Data entered P. pip 1/1/2016 14 passes P. pyg 1/1/2016 12 passes Pipistrellus 1/1/2016 3 passes (where species couldn’t be identified down to species level) However, for storing on our database this should be classified as 29 passes (rather than 3).

johnvanbreda commented 8 years ago

Is the intention here that if I put Pipistrellus into the manually specified parameters for analysis that it would be doing a comparison of all Pipistrellus records including the species as well as the genus?

johnvanbreda commented 8 years ago

Response from Paul: Yes, I’m envisaging the following: 1) User inputs data (for example 1 night of recording where the CSV file contains 4 rows of data:

P. pipistrellus 12 passes
P. pygmaeus 14 passes
P. nathusii 1 pass
Pipistrellus 3 passes (where the user couldn’t be certain which pipistrelle species they had identified.

2) The species data is automatically uploaded to our database but rather than 3 passes for Pipistrellus (at the genus level) the system automatically registers that there were 30 Pipistrellus passes instead and this is stored in the database. So the final rows in the database would be: P. pipistrellus 12 passes

P. pygmaeus 14 passes
P. nathusii 1 pass
Pipistrellus 30 passes (12+14+1+3) (genus level total)

3) During analysis a user manually enters Pipistrellus as the specified parameter and we search for the genus level total in the database .

johnvanbreda commented 8 years ago

Also I’d want to consider what happens on the import comparison when the import has both genera and species data - would you in effect get the genera nightlies or summary outputs with the species nightlies and summaries nested within?

johnvanbreda commented 8 years ago

Response from Paul: An example is probably easiest to clarify this. 1) If a species is entered then this is easy to deal with, i.e. Pipistrellus pipistrellus would provide a comparison/and output with just Pipistrellus pipistrellus from the dataset So there are 2 options if pipistrellus (genus) is entered. a) A night only has bat identified to genus level in which case the comparison/and output is made with the genus level total in the database. b) A night has bats identified to species level and some only identified to genus level. As explained above – the total number of passes from that genus is calculated and this total is used to compare with the genus level total in the database. For example: In one night, a user records 4 Noctule bats and 2 Nyctalus bats and we want to provide a reference range output for both of these.

4 Noctule passes would be compared with just Noctule passes from the dataset
6 Nyctalus passes would be compared with the Nyctalus genus level total in the database.

johnvanbreda commented 8 years ago

Having spent a while thinking about this today and yesterday and fiddling around with getting the reports to automatically aggregate the passes for a taxon into the output for a genus I think that your idea is the best way forward – we should actually store the totalled passes in the reference database table. I would actually propose that we store 2 values for the number of passes – firstly the number of passes as uploaded and secondly the number of passes calculated for use in the report output. But, I think it could be quite complex and there are quite a lot of cases to consider.

So, let’s work it through with a simple example. If I upload a dataset that contains: Nyctalus 3 passes on 25/08/2016 at SU123456 Noctule 5 passes on 25/08/2016 at SU123456 Common Pip 7 passes on 25/08/2016 at SU123456

So, I’ll process this so that in the database we get: Nyctalus 8 passes on 25/08/2016 at SU123456 Noctule 5 passes on 25/08/2016 at SU123456 Common Pip 7 passes on 25/08/2016 at SU123456

Agree

If I do a comparison of a single night for Nyctalus, it’s compared against the 8 passes, for Noctule its compared against the 3 passes.

Agree

If I then upload a 2nd dataset, this time just at species level: Noctule 9 passes on 26/08/2016 at SU123456 Common Pip 11 passes on 26/08/2016 at SU123456

I have to then auto-generate an extra entry in the reference set for: Nyctalus 9 passes on 26/08/2016 at SU123456 Otherwise the comparison for Nyctalus won’t get the full set of data.

If I do a comparison using the imports as the input (rather than specifying the parameters manually) then for the first import I’ll be comparing the following: 8 passes of Nyctalus against a reference set of [8,9] 5 passes of Noctule against a reference set of [5,9] 7 passes of Common Pip against a reference set of [7,11]

So far so good I think. Where it starts to get more complex: 1) If I do a reference analysis using the 2nd import as the input, do I include the auto-generated record of Nyctalus as if it were part of the import? I.e. should there be any output for the genus level, even though the import did not contain genera? 2) Further to the above, if you had an import with, say, 1 pipistrellus species and 1 rhinolophus species (ID at species level) then the reference output does not need to include the genus information as it will just be duplicated. But, if you had 2 pip species in the import and used this as the input to the analysis, then having the aggregated genus level output included makes sense. 3) If I (or someone else) uploads another nightly, say, 15 passes of a leisler’s bat on 25/08/2016 at SU123456 then we need to consider whether this gets combined into the existing total for Nyctalus on that night?

If the answer to 3 is yes, then what happens if the same thing happens but at a more precise grid reference which is contained within the same square (SU12394569 for example)?

I am thinking that perhaps this aggregation to total up the passes for a genus should only happen when the information is otherwise identical – i.e. the exact same location and recorder. Otherwise it just starts to get too complex.

johnvanbreda commented 8 years ago

Reponse from Paul: 1) I hope I’ve got this point, but I think I would be happy for only providing species-species comparisons, or genus-genus comparisons, i.e. if you upload noctule then you only get a comparison regarding noctule in return. My only concern here (and hopefully it wouldn’t happen) but people might upload Noctule data (i.e. at species level) and then re-enter the entire dataset but change Noctule to Nyctalus. I don’t envisage this impacting the actual analysis but we would be entering replicates to the dataset. I’m not sure if there’s an obvious way around it and maybe it’s something I could manually check a few months down the line after we’ve started accepting data? Or we just point them towards the ‘previous uploads’ tab if they wish to edit their uploads. 2) Just checking I’ve got this right? So if we had 1 p.pipistrellus record uploaded then the reference output would just be for p.pipistrelles? But if p.pipistrellus and p.pygmaeus were uploaded then we would give individual outputs for p.pipistrellus and pygmaeus as well as a genus analysis (pipistrellus)? I don’t think this is necessarily essential but if it was an easy outcome to produce (given that we’ll be capturing genus data anyway) then it would be a nice addition. 3) I think this relates to my answer to 1) how we distinguish between separate surveys (Which could easily happen with 2 detectors placed near to each other. The measurement we are using is passes per detector per night so trying to group together multiple recordings to essentially give a ‘passes per grid square’ is risky I think. I therefore think we treat them separately and therefore both species/genus calculations are kept separately. I think the only potential problem would be what I pointed out in 1) but I think monitoring the dataset early on should spot this 4) No, but I agree that we should only capture genus data during a single upload, when all the ‘essential’ information is identical and we treat all separate uploads as different detectors unless we have specific info to treat it otherwise.

johnvanbreda commented 8 years ago

Ok, trying to summarise all this into a list of technical requirements/changes:

Within a single upload, we will create extra reference set entries automatically for all the genera present in the import that don't have their own row. A genus is created for each unique combination of date and place within the import.
For all the genus entries in the reference dataset, we'll calculate the number of passes as the number of passes provided in the import at genus level (if present) plus the sum of all the passes for the species within the genus.
If the genus level entries simply contain a single species with the same number of passes then they are used in the reference dataset when comparing against that genus. But, when you do a comparison against the import containing that genus entry there is no need to include it in the output as the information is duplicating the species. The genus is only included in the output if it adds new information.

Regarding the first point, there will be no attempt made to add any further intelligence to this, e.g. matching more precise grid squares into larger grid squares, or matching across different imports when aggregating the genus data.

PaulLintott commented 8 years ago

I'm happy with the technical requirements/changes above. One final point worth considering is if we want to capture data under the Nyctaloid classification. For example, if we had one night which contained: 5 passes of noctule 2 passes of leislers 1 pass of serotine then this would be captured at the genus level as: 7 passes Nyctalus however we could also capture it as 8 passes at the Nyctaloid level.

alternatively if we just recorded noctule (10) and leislers (5) then this could also be classified as 15 passes at both the Nyctalus and Nyctaloid level.

I think there is a slight problem that serotines only have a southern distribution so we might have Nyctaloid records (comprised of noctule/leisler) from northerly areas which aren't true records of Nyctaloid activity. However, assuming users stratify by location then this shouldn't be a problem as they would only by comparing Nyctaloid records from a similar region where Nyctaloids were known to be present. It'll mean we have erroneous records in the dataset (i.e. northern records of Nyctaloids) but these shouldn't come into play?

johnvanbreda commented 8 years ago

I've implemented most of this in terms of the summing of passes at genus level - please can you check the analysis output now before I put things live as it was quite a big change. For the Nyctaloid output, I've coded it so if we make Nyctaloid a parent of both Nyctalus and Eptesicus in the species list data on the warehouse, then it should be output as you suggest in the first 2 paragraphs of the comment above. If you have some noctule records in the south, then we will autogenerate a Nyctaloid reference entry whether or not there are serotines present. This is necessary to allow a comparison against a Nyctaloid level analysis. I don't see this is any different to doing the same for northerly records where serotines definitely won't be present. You also have to bear in mind that you might want to analyse against Nyctaloid using a point in the north of the distribution of serotines as your analysis point - in this case you need to pick up the Nyctaloids (i.e. Nyctalus) north of the dividing line.

PaulLintott commented 8 years ago

Hi John, It looks like we have quite a few glitches after the updates. For example, I've just uploaded some test data for both Nyctaloid and Pipistrellus pipistrellus. The output we received when analysing the common pip (with all levels of stratification turned off is..

"There were 14 nights of surveying, all of which were classed as moderate/high activity with 0 passes and a percentile of 72. This was calculated by comparing this import with 90 records of nightly activity all recorded and using the same pass definition."

We noticed that when you added the option of stratifying by make yesterday it took a while to upload so we'll keep checking and running data through to see if it fixes itself..

Thanks

johnvanbreda commented 8 years ago

Hi Paul, I think that what happened is when you visit the http://192.171.199.237/index.php/scheduled_tasks?tasks=ecobat link, there was a glitch in my new code so the import was not properly processed. I've now fixed that and run the link properly, so the output is now working OK I think.

PaulLintott commented 8 years ago

Hi, that looks good - thanks. The output for common pipistrelle looks goods - it still looks a bit stunted for Nyctaloid still but that could be a product of a low sample size?

"There were 6 nights of surveying, all of which were classed as moderate/high activity with 21 passes and a percentile of 67. This was calculated by comparing this import with 45 records of nightly activity all recorded and using the same pass definition."

Is it missing a few terms (i.e. median?)

Thanks

johnvanbreda commented 8 years ago

I've fixed that - it was treating it the way it would if all the activity levels were identical which is not the case.

PaulLintott commented 8 years ago

Thanks - happy with all of the output we are now producing

Indicia-Team / ecobat

Genus level data collation #3