SnapshotSerengetiScienceTeam / DataManagement

Scripts and issues to manage the SnapshotSerengeti images and metadata.
GNU General Public License v2.0
0 stars 0 forks source link

Prepping S8 #24

Closed mkosmala closed 9 years ago

mkosmala commented 9 years ago

From the email thread:

@palme516: I'm looking at the S8 cleaning output file. The columns seem to be different from those in S7. The one from S7 is in shared/TimeStampCleaning/CleanedCaptures. Specifically, after "path", in previous seasons, there was "newtime","invalid","include". I think your "NA" should likely be "newtime". "invalid" is correct. "sr" is redundant and doesn't need to be output for this final file. "timestamp" and "timez" should also be not output for this file. @aliburchard, do you remember what the "include" column was supposed to indicate? Was it just whether "invalid" was zero or not?

It turns out I haven't yet written a script to put a cleaned capture file into the database -- because we didn't have clean captures for S1-6! It will be straightforward to write one for S7 and S8, but it will be helpful to have the file formats be the same, so the same script can be used on both.

I count 521 rolls represented in the S8 cleaning file. Does that mesh with your count of roll directories, @palme516? It's good to double-check that the scripts got everything.

A couple other things, for both @palme516 and @aliburchard :

1) I wanted to sanity check some of the images marked as invalid, and so I looked at the first one that had a lot of images: R13_R1. And while something weird clearly happened around image 691, I'm not sure why these are marked as invalid. The timestamps look reasonable, and spot-checking images after this one, it looks like the images are fine (if not the best lighting quality). Can we get the script to output the reason images are marked as invalid? Or does this already happen in a log file produced by the scripts?

2) I also sanity checked the number of images per capture and invalid statuses. All captures with more than 3 images have invalid status of 0 -- including 23 captures that have >10 images, most of which are on the same roll (L10_R1). Do the cleaning scripts check for these? Maybe they should?

And here's the summary of captures from this file (invalid code, number of captures) 0 359,987 1 1,873 2 12 3 352

meredithspalmer commented 9 years ago

Off the top of my head, I don't recall what the "include" column was supposed to contain. Apologies - I didn't have a copy of the S7 output (and couldn't find it on S7) to compare the S8 output to before sending it along to you, but I'll crawl through the script again to try and figure out why the columns are different and get back to you. Could you send me a copy of S7, or let me know where it is on MSI? Also -- the 7th should be fine for cleaning up MSI. I'll be working all day, so buzz me whenever you have time. I'll also recount the rolls and look at what's going on in R13_R1.

mkosmala commented 9 years ago

The one for S7 is at shared/TimeStampCleaning/CleanedCaptures.

I'm pretty sure I asked @aliburchard to add an "include" column when we were working on S1-6, but I don't remember why. I'm hoping she does!

meredithspalmer commented 9 years ago

Perfect, got it!

aliburchard commented 9 years ago

Heya - The "include" column was whether or not images should be sent to Zooniverse for identification. Anything that wasn't an "invalid = 1" would have received an affirmative include value. It was just to simplify the dataframe that I sent your way.

As far as the other bits: The flagging script should indicate why (roughly) captures are flagged as invalid - timestamp out of range, or negative time lag (e.g. image 2 is timestamped before image 1). We could probably tweak the scripts to carry this through to the final dataframe.

@palme516 my memory is a bit fuzzy on this, but I believe the flagging script outputs a file of flagged events that you then use as a spreadsheet to address each flagged event and record a set of actions. Do you have that (or another record of the various things that were flagged, comments on what was going on, and why you decided to do what)? If not, we should definitely be doing that...

@mkosmala the original flagging script yelled about captures >3 images. Since the capture event creation script could accomodate captures >3 images, I either deleted that or downgraded it so that it was a sort of ignorable error, because there were a handful of captures with >3 images that simply were the result of 2 triggers within our 15 second time-delay window. @palme516 -- does the flagging script still flag captures with >3 images? If not, do you think you can add that to the script?

Note, however, that it might not be worth investing a whole lot of time into the script other than as an exercise in coding. It might be best to rewrite in Python. Or, depending on future funding (i.e. if there is no S10), deal with the idiosyncrasies manually for S8 and S9.

mkosmala commented 9 years ago

Great, thanks, @aliburchard. So about the captures >3 images, I don't care too much about ones up to about 10 images, but it kinda breaks the SS interface to have 50+ images in a capture. I haven't looked at the images to see if they really do belong together (they have the same timestamp), but I'm guessing it should at least be looked at. @palme516 Can you spot-check some of the >3 image captures with tons of images to see what's up?

And one thought on R13_R1, before I've had a chance to look at the flagging notes: I noticed that some images appeared to be missing. Very possibly it's because they were videos. If that's the case, then, @aliburchard, does the flagging script invalidate all images after a video/missing image?

Okay, so @palme516:

  1. No need for an "include" column. It was useful for S7, but I won't need it for S8.
  2. The outputs from the various flagging scripts and whatever to-do spreadsheets that exist should end up at MSI so that we have them as reference. (Where? We can talk about that next week... Meanwhile maybe you can email them to me?)
meredithspalmer commented 9 years ago

@mkosmala Apologies for taking so long to get these files to you -- my internet connection this week has been absolutely pathetic. Here's what I've got so far:

I'll go through and spot-check the 3+ images, particularly the R13_R1 roll, and let you know later this week when I have a better connection.

mkosmala commented 9 years ago

Hey, @palme516 and @aliburchard, where are we on this? Feels like S8 is a bit stalled... Is anyone waiting on me for something? Are we almost there?

aliburchard commented 9 years ago

Hiya - Sorry, I've also been out of the loop the last couple weeks and am now just catching up.

I think we are waiting on @palme516 to sanity check some of the funky invalidations? From this thread, it looks like there were a number of really large capture events that went through flagging and Meredith was going to spot-check those.

@mkosmala you noticed a number of images that appeared to be missing, but the flagging output hadn't yet been uploaded to see why.

Also, @mkosmala you asked me whether all the images after video files are invalidated in the flagging script. They shouldn't be (though @palme516 can you check the script? Maybe there is a bug. But there shouldn't be.)

I can't get my VPN to work at the moment, so I can't get onto MSI and check the flagging script just now. I'll try again tomorrow at the office. We're getting a lot of interest from users (and Darren McRoy is super persistent about asking when S8 is going live), so let me know if there is anything waiting on me too.

Cheers, ali

aliburchard commented 9 years ago

Actually, @palme516 @mkosmala - think we could promise the launch by the end of January? I'll check with Michael, but pretty sure image ingestion will go pretty quickly - so as long as we can get our end sorted by, say, Jan 20th or so, we should be able to get them online by Jan 31.

meredithspalmer commented 9 years ago

@aliburchard, @mkosmala -- the reasons for the funky invalidations are listed in the S8 Action List, which I uploaded to MSI last week. I did do back just now and give them another look over (apologies for not getting this to you right away), but what seems to be happening is what's mentioned in the Action List: there's just a TON of video, especially in R13_R1, that had to be removed. I don't think there's anything you can do about that.

I don't believe the flagging script marks images after videos as invalid. But, for example, in R13_R1, when the camera starting taking video, there were only 3 pictures interspersed in all that non-data (and they were misfires) - so I marked that section (which continued until the end of the roll) as invalid because I didn't think we could legitimately count the camera as "on" (functional for data collection).

meredithspalmer commented 9 years ago

Okay, I'm looking at the cleaned captures and going through the hard-drive for L10_R1 -- you're right, @mkosmala - there's some large captures in there. Looking at the images, the photos are taken in bursts of three, but the timestamp is sticking for multiple capture events in a row. From looking at the pictures, the "stick" that occurs at 1675 significantly messes up the timestamps for the subsequent images in the roll. I'm not sure if we can fix any of the subsequent capture events, but I do believe that most of this roll should be marked as invalid.

aliburchard commented 9 years ago

@palme516 Awesome, thanks for that.

I'm fine with marking the three R13_R1 misfires as invalid and that the camera doesn't really count as functional during that time. ( unless @mkosmala any objections?)

Two thoughts: 1) Cameras that get triggered to video still take stills at night if they trigger. So we could in theory have gotten nighttime captures during that time, but not daytime (because they are video). But because of that bias, I think it's fine to just call the camera quits there.

2) I just realized that there's an excellent chance that in the next year we could get these videos put up on a Snapshot Serengeti: video edition. However, since we calculate camera activity dates as the dates on the first and last subjects, if we wanted to integrate video, we could add them into the DB (marked as video images) and write the "activity dates" query to include those.

mkosmala commented 9 years ago

@palme516 and @aliburchard : I checked R13_R1 just now, too. There are 67 captures marked as invalid (not 3, Meredith, how did you get that number?) They all have nighttime timestamps and are single-image captures. So there you go. I'm fine with them being "invalid" since calculating camera trap effort for this time would be incorrect since it was technically only operational at night. (Although, I do hope we don't have anything awesome in there!)

@palme516 : Okay, can you check the other >3 image captures by hand and see if there are other things that need to be invalidated? Also, can you both figure out why the scripts missed these large captures?

@aliburchard : yes, I think we could promise by end of January. I think it's just these weird large captures that need attention and then we're ready to go.

aliburchard commented 9 years ago

Follow up:

I was able to get on MSI finally. On the upside, I see there are very few actions for S8! That's fantastic.

@palme516 , I'm pretty sure that the "First" column (that contains text in the format "S8_C06_R") is an error -- that it should be IMAGE numbers of the first image in the bad image sets -- but because we renamed image files to include season, site, and roll, the name is being split up and reporting that. Doesn't look like it was too much of an issue for the errors you had to figure out, but can you review the scripts to make sure they are updated to handle the new filename structure? I'll open a new issue on GitHub and assign you. I'd recommend checking any functions in the scripts that call image filenames...

Also, I notice that for J06_R3, you said that you don't have the date that roll was collected in the field data. Why not? Did you or Norbert not record it? Or did you not download the field entry DB when you returned to MN? If it's the latter, can you just email Norbert to find the date? It's a pretty quick fix -- no reason to send the images to invalid purgatory if there is a quick fix. (Given that Invalids 2 and 3 will likely never be addressed because they are really really HARD. But we are keeping them in case.)

Another suggestion would be to do some simple visualizations and sanity checking after you've run the corrections -- which is how I suspect Margaret discovered the massive L10_R1 captures. For example, look at the resultant capture data frame, look at the min and max dates, look at max(captures), etc etc. I don't know why L10_R1 was't caught, but it's for those reasons that it's good to sanity check manually.

In sanity checking L10_R1, I see there are 16 captures with 4-12 captures, and 18 captures with more than 12 images in them. We should try and figure out why these didn't get flagged as bad captures, but @palme516 can you go through these specific capture events and see what's up?

aliburchard commented 9 years ago

@palme516 @mkosmala - what if we marked R13 images as Invalid2? That way they will still be sent to Zooniverse. They'll still be discarded from analyses and calculations of camera effort, but leave open the possibility of using later. This is actually a problem for all rolls that go to video, so might be worth devoting some effort down the line -- say, images from these cameras could be used for nighttime-only analyses...

mkosmala commented 9 years ago

Sounds good to me.

meredithspalmer commented 9 years ago

Margaret right about R13_R1 - my apologies.

I think roll L10_R1 is shot, and that images starting with 168 should be marked INVALID1. L10_R1 is the roll with multiple bad captures of extremely long lengths, throwing the timestamps off for the entire roll multiple times.

Regarding L10_R1 and the other bad (>3) captures, I'm emailing out a table with the image paths, their current invalid status, my suggested invalid status after looking at the images, and comments on why. Fortunately, there's very few bad captures outside of roll L10_R1, but they all come from the camera at L10 (R2 and R3). I might email Norbert and see if he can pull the camera down and replace it with one that isn't having these issues. With each of these three rolls, there's a point where the bad captures mess up the timestamp for the entire roll. I've commented on the specifics in the file I'll send out - I think we can estimate the number of hours to add to some pictures for R2 or R3, but I don't think there's any way to know for certain.

Today I'll go through the flagging scripts to see why L10_R1 didn't pop up on the ActionList (and fix the "first" column, if I can).

As for L06_R3, I did bring back a new database from Serengeti, and this roll didn't have an entry. I've been asking Daniel to resend me the updated database for a month or two now, but when I did get a "updated" version, it was the same one I had entered several previously. Daniel's now on vacation and hasn't answered any of my emails for the last week or so. I'll try emailing Norbert to see if he can figure out how to get me a copy of the new database. Also, I did get the following email from Norbert about several rolls I had questioned him about:

Analysed from data base= 30/1/14 PO4_R1 16/04/14 PO4_R2 21/06/14 P0_R3

             from data sheet=30/01/14 PO4_R1
                                      =16/04/14 PO4_R2
                                      =21/04/14 P04_R3

           from extenal driver=30/01/14 PO4_R2 =R1
                                         =16/04/14 PO4_R3=R2
                                         =21/06/14 PO4_R4=R3

mistake when rename sorry for that meredith

              from the data base =7/02/14 MO7_R1 in this round ct &cd card was broken couldn't upload&copy
                                        =24/02/14 MO7_ replaced new
                                        =01/05/14 MO7_R2
                                        =10/07/14 MO7_R3

              from data sheet=07/02/14 M07_R1cd card&ct was broken
                                       =24/02/14 MO7_New ct
                                       =01/05/14 M07_R2
                                       =10/07/14 M07_R3

           from external driver=01/05/14 M07_R3=R2
                                          =10/07/14 M07_R3 = did mistake when rename&copy      

It looks like from this that we'll have to rename a couple of rolls

aliburchard commented 9 years ago

I'll work on adding catching O vs 0 to a rename script. @palme516 Have you now renamed these rolls?

I'd also suggest keeping as close an eye as possible on the camera trap data while you're actually in the field (especially since presumably you're doing most of the sites and Norbert/Daniel are only doing them when you're not there). As you can see, it's incredibly hard to figure out what's happened by emailing Daniel/Norbert.

I know it's really hard to maintain, but I'd suggest having a weekly or bi-weekly phone call with Norbert and Daniel to keep on top of things and ensure back-ups are happening, issues are being addressed, etc et.

meredithspalmer commented 9 years ago

Okay, just got around to this this morning -- I've renamed the directories and file names for all folders/images in P04_R2, 3, 4, (-> P04_R1, 2, 3) and M07_R3 (-> R2) on both MSI and in the "S8_cleaned.csv" document. @mkosmala, I've re-uploaded this file onto MSI.

meredithspalmer commented 9 years ago

Yeah @aliburchard, that's a good idea. I'm supposed to call Daniel this week regarding another issue, and I'll set up some time with him to talk at least a few times a month...

mkosmala commented 9 years ago

Looking through the updated S8_cleaned.csv file now...

1) @palme516, there are blank lines here and there. And some come in the middle of captures. Why is that? What's going on?

2) @aliburchard, did you ever get a chance to see if captures with two different invalid codes appeared in S1-6? I ask, because the metadata database only assigns invalid codes to captures, not images. If some images are marked 0, and others are marked 3, they'll still all be put in one capture and that capture might be marked 0 or 3, depending on whether my script used the first image's validation code or that of the last image. I think it would be best to split large captures into two captures, one of which is 0 and one of which is 3. I suppose this can be done in my script that imports into the database, but it means that capture numbers won't match the "cleaned" files. Can we do it somewhere in the cleaning scripts?

Other than that, I can't find any remaining problems. Here's a summary of the invalid images. @palme516, does this feel right to you?

site num images
C06 934
E13 3
J06 30
L02 2
L06 1
L10 5739
L11 1
Q05 1
R07 1
R13 67
S07 6
S08 5191
Total 11976
aliburchard commented 9 years ago

@mkosmala damn. There are a small number of capture events that have multiple invalid codes. I don't know how you made your nice table, but the ugly one below shows sites, roll, capture ID with >1 invalid type, the total number of types (countinvalids, all = 2), the total pictures in that capture, and then the number in each invalid type (pix0, 1, 2, and 3).

I'm not sure what the best approach is, since all but S7 will already be in the DB. I don't love the idea of splitting the captures in the cleaning script, because the capture_creation script is supposed to create the definitive capture event ID, and the R scripts are just so fiddly. For past seasons, we might just get away with deleting this images (not ideal, I know, but if they are the ground or the inside of the car, not so bad)...I don't know what the best solution is for S8 and future seasons though.

season site roll capture countinval totalpix pix0 pix1 pix2 pix3 1 4 G12 1 5 2 3 1 2 0 0 2 5 C04 1 476 2 2 0 1 1 0 3 5 F01 2 185 2 3 2 0 1 0 4 5 U13 3 197 2 3 1 0 2 0 5 5 U13 3 229 2 3 1 0 2 0 6 6 V10 2 34 2 3 2 0 0 1 7 7 L10 1 3 2 42 3 39 0 0

meredithspalmer commented 9 years ago

@mkosmala, those seem about right except for E13 and S07... off the top of my head, I don't remember invalidating those. I'd go back and check, but as I'm re-running everything at the moment, I'll go through the output of this next run and make sure that the total invalidated images match up.

aliburchard commented 9 years ago

@palme516 @mkosmala I've had a quick glance at the newly outputted S8 captures.

All the valid dates are within Sep 2013 thru July 2014 -- does that sound about right?

These three Site/Roll/Captures have more than one invalid type, which we've realized the DB can't accommodate. How did we decide to fix these?

site roll capture S07 2 83
L02 2 1
L06 2 1003

There are 178 captures with a duration > 2 seconds. Can you maybe just spot-check these? They are probably fine, but just to check.

Looks like only one completely valid capture has > 3 images - E13, R2, capture 682. It only has 4 images, but Meredith can you check this capture if you haven't already?

Otherwise, things look good for the quick check. I'm uploading my R script to check the captures in case you want to look at it; under R-Scripts/TimeStamp-Cleaning/SanityCheck.R

meredithspalmer commented 9 years ago

@aliburchard @mkosmala Blaargh, for E13, R2, the second image of capture 681 is listed as being in capture 682. Thanks for catching that, I'll fix that up in R and reupload the cleaned caps to MSI just now.

mkosmala commented 9 years ago

@aliburchard @palme516: I think that for now, let's just go ahead and ignore the captures that have multiple invalid codes. Let's not slow down import to Zooniverse because of this. I've opened a new issue ( #43 ) to address the database import, which we can figure out later.

Meanwhile, what do we do for Zooniverse import? We want Zooniverse to slurp up images with invalid codes of 2 and 3, right? Then we should mark images in large captures that we don't want Zooniverse to slurp up as invalid code 1. If this feels bad, then maybe we need to make an invalid code 4? Is there any reason to have an invalid code of 2 or 3 on these surplus images in captures?

aliburchard commented 9 years ago

Agreed on ignoring this for now - we should try to get the images and manifest to Chris Snyder by the end of the week. No idea how long the image transfer and then ingestion will take.

@mkosmala the general approaches for invalid codes hold - we want all invalid codes 0, 2, 3 to be classified. We don't want images with invalid code 1 to be classified.

Invalid codes 2 and 3 are for images that contain information (e.g are not 1000 photos of the ground) and might possibly be recoverable one day. Thus images in large captures that were marked as 2 or 3 are still worth identifying and we still want Zooniverse to pull them in.

mkosmala commented 9 years ago

@aliburchard, yes, but there's still an issue we need to resolve before sending along the manifest. If a capture has images that have invalid codes like so: image1 - invalid 0 image2 - invalid 0 image3 - invalid 0 image4 - invalid 3 image5 - invalid 3 ... image84 - invalid 3

then all those 84 images are going to be slurped up by Zooniverse and we're going to have a problem with that capture on Snapshot Serengeti -- both displaying it and interpreting the results.

I think for captures with >3 images, all images after the 3 must be labeled with invalid code 1 so that they don't go to Zooniverse. If we think we actually do want them to go to Zooniverse, we have to split them into multiple real captures. Otherwise we break the Snapshot Serengeti interface. If we can't split them into multiple real captures, then all images after the third should be invalid=1.

Make sense?

aliburchard commented 9 years ago

@mkosmala hmmm good point. excellent point. damn.

So this is the only remaining hold up for sending to Zooniverse, right? And we don't have a good way to split those captures...@palme516 can you see if those images have anything in them worth keeping and doing the manual splitting? If not, we can just mark them as 1 and proceed to the Zooniverse...otherwise, perhaps as Margaret suggests, give them an Invalid Code 4 for now, which means that we will address them manually later but not import to Zooniverse now?

meredithspalmer commented 9 years ago

I've just checked through the images and it looks like it should be a fairly simple fix:

S07_R2_C83 is 12 images that contain wildebeest, but they're pretty obviously split into different capture events, so I'll go through R and manually change these each to their own CE, with the subsequent CEs being given tags of INVALID 3

L02_R1_C1 is three images, the first two of which are INVALID 1. There's nothing in the third one either, so I'll go ahead and switch that to INVALID 1 as well.

L06_R2_1003 is four images, there the extra pic is already INVALID 1, which we decided was okay, correct?

I'll have a fresh database up on MSI in just a few minutes

mkosmala commented 9 years ago

@palme516 Have you also looked at any >3 image captures that are all marked invalid 2 or invalid 3? We also don't want to have a set of 84 images that are all invalid=3...

meredithspalmer commented 9 years ago

Okay, went through and either separated out the captures or, if the extras were just misfires, marked them as invalid 1. There is now no capture event >3 where the images are marked as anything other than okay and invalid 1.

mkosmala commented 9 years ago

Yay! @aliburchard Let's get the images uploaded!

aliburchard commented 9 years ago

yay! alrighty, I'll ping Chris S to see how he's coming along with access to MSI...

aliburchard commented 9 years ago

@palme516 -- Michael Parrish noticed that large sections of E11 (rolls 1, 2, and 3) are solid black. I'm wondering if the flash is broken? Can you check that roll out quickly to see if it should be invalidated? Also, if it is the flash, I'd mark that site for replacing the camera trap.

I don't know what the best approach is for such cameras. Maybe we should consider adding a code to the roll to indicate if daytime/nighttime images are okay. For example, on rolls with no flash, daytime images are still alright. On rolls that have switched to video, nighttime images are still okay.... So if you were comparing daytime capture events across sites, rolls with no flash would still contribute valuable data....

meredithspalmer commented 9 years ago

@mkosmala Guh, yup, the flash is definitely bust. It looks like about a third or just a bit less of each of those rolls is solid black. The problem is that the camera is consistently misfiring at absolutely nothing, which is what there are so many. The daytime pictures are absolutely fine (although there's very little in them, because of the constant misfiring). Having a code for this kind of thing might be good. I've been looking at several site's worth of data from Season 9 from my playback this last summer, and it looks like there may be a couple cameras with broken flashes. I'll spend some time today taking a look at the CT pictures from S8 to see which sites have broken flashes and compile a list for Daniel to replace, if he can.

meredithspalmer commented 9 years ago

@mkosmala @aliburchard I just had some undergrads run through pictures from each site, and there appear to be ~15 sites where the night flashes just don't work. There seem to be enough to warrant an additional Invalid code for day pics only.

There was also a roll where the night pictures were just fine, but the day pictures didn't work. I know you mentioned that videos only mess up day pictures -- would a code for night only pics be overkill?

Anyhow, I'll email Daniel the list of messed-up cameras to see if he has the resources to replace them...