Addition of records not in eBird dataset

Louis-Backstrom commented 5 years ago

@dbl3raf and I have been discussing what we should do with some of our specimen records - eBird protocol technically forbids dead birds and as such a lot of the records that we have of specimens have been/will be invalidated and not turn up in the dataset when we download it, unless we can convince the moderators to allow them through.

One possibility we had considered is emulating the eBird data style and creating an appendage list of records to go onto the download each time we update it so it reads it as eBird data from within the Atlas? This would probably be pretty easy, the only concern I have is with trying to actually edit the 300MB text file.

The other possibility, but probably one that requires more back end work is if we create a new datafile (same format etc as the main one) with the additional records, and just have the Atlas read both of them? Would this be easy to do @jeffreyhanson ?

jeffreyhanson commented 5 years ago

Hmm, I'm not sure what the best option is. I think it probably makes sense to keep the eBird records and the specimen records (hereafter, I'll call this "specimen dataset", but let me know if there's a better name?) in separate files so it's easier to keep track of things. But I'm not sure if it makes sense to force the specimen dataset into the same format as the eBird dataset - I image a lot of the eBird columns will have missing values in the specimen dataset? Also, are there columns that should be in the specimen dataset that are not present in the eBird dataset?

Also, how should the atlas display the specimen data? I suppose these records could simply be merged with the eBird data after the two datasets are imported in the code, but then I think this would make interpreting the maps and graphs more difficult (e.g. adding specimen records will change what "reporting rate" actually means). Or should the specimen data be shown in its own set of graphs and/or maps?

Louis-Backstrom commented 5 years ago

The specimen dataset certainly doesn't have to be in the same format as eBird - I just figured that would be easier for you to add it in that way. Most eBird entries don't use all the columns anyway.

I can't think of any columns that should be in the specimen dataset that aren't present in the eBird dataset, but @dbl3raf might have some ideas.

I think it's best if they're all just treated as normal data. For pretty much every species we're actually putting in specimen records for, they'll be the only record of their species and will be treated as vagrants, so graphs etc. won't show anyway. Plus, all of them will be treated as non-complete data (ie incidental) so even if we have graphs etc. they will be treated as the rest of the data that are not complete. Hope that makes sense.

jeffreyhanson commented 5 years ago

Ah ok. If I understand correctly, the specimen data just need to be appended to the eBird data as Incidental records, and should appear in all tables/maps/graphs that use Incidental records? If so, then this should be relatively straight forward. I can have the code automatically insert extra columns as needed to merge with the eBird data internally - that's really easy - so I think it might be easier to just have the specimen data contain only the columns which contain useful information (i.e. so we avoid columns with NA/missing data values)?

Do you have any specimen data ready so that I could try updating the atlas code? If so, how big is it? If it's <10 Mb, we could add If so, could you please add it to the data folder, under a new sub-folder?

Louis-Backstrom commented 5 years ago

Yes, that's essentially right - treating them as incidental records would be fine. At the moment they're technically entered under eBird's Historical protocol (with the "complete checklist" box unchecked) - but I don't know what the functional difference to us is.

I don't have any data ready (other than what's already been entered into eBird, but that won't turn up in the dataset for a month or two) - I can get a small play set ready over the next few days. At the moment though, it seems the records have all been revalidated so maybe this won't be a problem. It would be good to know for the future though, so we could easily incorporate other datasets that we're not allowed / would be painful to enter fully into eBird (e.g. QWSG data, government datasets, etc).

How does all of that sound? I can probably get a 25 line dataset ready for the weekend.

dbl3raf commented 5 years ago

this will be very useful functionality - I can envisage several uses for such an appended dataset that we maintain outside of eBird, but join with the eBird import each month

jeffreyhanson commented 5 years ago

Ok - yeah, that sounds great. @Louis-Backstrom, if you can get a small example dataset together that would be fantastic.

Louis-Backstrom commented 5 years ago

Alright, no problem.

As an experiment, I've just submitted a data request via ALA for all the QM's specimen records from within the Brisbane LGA. Perhaps we can try and get that to work as a start? I'm anticipating a number of problems, namely taxonomic (ALA taxonomy seems to be a complete mess, and certainly isn't the same as eBird). It's about 3000 entries.

I'll work on a smaller mini-dataset that's based off eBird's (probably using the same specimen dataset but just for a dozen or so entries) over this week too.

E: There's now a second .zip file in the data\records folder which has the ALA dataset mentioned above. If you want to try getting all of those in (as incidental records) that would be awesome @jeffreyhanson . There will be a couple of ones that will possibly be duplicates of records already in the eBird dataset but that's not a problem at the moment.

(This appears to have broken the build - I'm assuming that's kind of expected)

Louis-Backstrom commented 5 years ago

Right, I've made a tiny (1 entry so far, I might add more if I have time) additional dataset in fd43c82. It follows the general eBird format just with a whole bunch of extraneous columns removed.

I'm not really a fan of having a dataset for a couple of reasons.

1: I think it will be confusing to users to have not everything coming straight off eBird (especially in the case of the current record in the file, Stejneger's Petrel currently does not have a record in Brisbane - the Mooloolaba record is "inaccurately" put on the pelagic hotspot, in Sunshine Coast waters), so it will be hard to trace records back to their origin.

2: eBird is just so useful to us, so I think we want as many records as possible (ideally all) on that platform.

3: eBird is a much easier (imo) platform to enter records onto - the manual creation of new records for a file such as this is painstaking and will probably result in errors. For other instances where we're just augmenting a new database into the project (e.g. the ALA one added earlier this week) that's less of an issue, but still it would be much nicer if we could just bulk import the database into eBird (but that would be tricky).

Thoughts?

dbl3raf commented 5 years ago

On reflection @Louis-Backstrom I think you're right - perhaps if we can't get records accepted into eBird then we should perhaps just consider them unproven. A big import is relatively easy, but personally I think we should only add significant records to eBird, and not unquestioningly import large numbers of records from, e.g. QM specimens. Many of these could have been picked up dead, we don't know how accurate the coordinates are etc.

Perhaps then we should hibernate this issue? We can keep discussing it, but Jeff doesn't need to work on it

Louis-Backstrom commented 5 years ago

Yep, agreed. I guess the only concern is if eBird decides to suddenly invalidate all our notable specimen records or something, then we need a backup database.

I agree that we should have eBird cover all the main distribution data etc. and just have notable records added if needed - not whole databases as it's hard for us to vet them all etc. Even something like QWSG's database (should we wish to use that), while presumably at an equally high if not higher standard than eBird, would mess up our reporting rate info as it would unfairly boost shorebirds (as currently - if I understand correctly - the assumption (however wrong) is made that reporting effort is distributed evenly across the region etc, so the overall reporting rate is a half-decent approximation for commonness).

Anyway, I think our priority should be to get as many records onto eBird as possible - especially for notable species - and work from there, and only try and get Jeff to work out secondary datasets should we actually require it. I'll hibernate this issue for now.

bird-team / brisbane-bird-atlas

Addition of records not in eBird dataset #126