SnapshotSerengetiScienceTeam / DataManagement

Scripts and issues to manage the SnapshotSerengeti images and metadata.
GNU General Public License v2.0
0 stars 0 forks source link

Import S9 and S10 into database #54

Closed mkosmala closed 8 years ago

mkosmala commented 8 years ago

@palme516 Did you rerun the cleaning scripts on S9 and S10 (in addition to S8) after you fixed them?

meredithspalmer commented 8 years ago

@mkosmala The issues we had with the S8 CEs were ones that I had a separate script for in R (the first time I had to deal with separating the CEs, so I wanted to make sure I knew what I was coding).This is where the issue you just caught/I just fixed cropped up and where it was debugged.

Seasons 9 and 10 didn't use these scripts: I modified the cleaning scripts themselves after I figured out what to do with long CEs and ran fixes from the Action List. That is to say, I updated our scripts earlier (the ones now uploaded on GitHub and MSI) and S9 and S10 were run off those and should be good.

mkosmala commented 8 years ago

@palme516 Great! Then can you create season files for S9 and S10? I made a ref doc here: https://github.com/SnapshotSerengetiScienceTeam/DataManagement/blob/master/How-to/create_season_metadata_file.txt

meredithspalmer commented 8 years ago

@mkosmala Alright! These are up for S9 and S10 now.

Note: I caught an obvious timechange error that hadn't popped in the field data or cleaning, so that is now fixed in the S9 Action List and S9_cleaned files and these have been reuploaded to MSI.

mkosmala commented 8 years ago

@palme516 Your cleaning scripts are creating two different output formats.

In the cleaned files, for Seasons 7 and 9, there is a column at the beginning with no name that numbers each line. For Seasons 8 and 10, this column is missing.

I don't really care if that column is there or not, but the format needs to stay consistent. I modified my script between seasons 7 and 8 to read files that don't have that column. So my preference would be to not have that column. But if the column is important to you for some reason, then keep it for ALL cleaned files, and I'll change my script back.

Let me know if you are going to change the cleaned season 9 to remove this column or if you are going to add the column to season 10.

meredithspalmer commented 8 years ago

I'm sorry - this is the difference between keeping or removing row numbers when I save the file from R. I don't need these, and can delete them from S9 and reupload. Doing now, should be done in a sec.

meredithspalmer commented 8 years ago

Okay, cleaned S9 has been reuploaded.

mkosmala commented 8 years ago

S9 and S10 have been imported. Note that invalid code 3 wasn't used at all for either season. Seems a bit weird? @palme516 When you get the chance can you check these summaries from the database to make sure they look reasonable?

S9 478 rolls 372,617 capture events 989,761 images

369,564 capture events ready to go to Zooniverse (99.18%) By invalid code: 0: 369,564 1: 3,053 2: 0 3: 0

S10 326 rolls 253,200 capture events 686,123 images

253,197 capture events ready to go to Zooniverse (all but 3!) By invalid code: 0: 253,046 1: 3 2: 151 3: 0

meredithspalmer commented 8 years ago

@mkosmala In Africa right now, will check it out first thing when I get back next week.

meredithspalmer commented 8 years ago

This is correct --- I had no invalid3's for either season.

mkosmala commented 8 years ago

And how to the summaries look? (Are there no invalid 3's because the error-checking scripts are better than they used to be? We had invalid 3's in every previous season...)

meredithspalmer commented 8 years ago

I'm getting very slightly different numbers:

S9: 369,566 CE sent to Zooniverse S10: 253199 CE total, 253199 CE sent to Zooniverse

But same number of images, rolls, etc. -- could you send me an extended summary so I can look at where the differences occur?

According to the notes in my ActionLists, the majority of the invalids in these seasons are misfires with no animals present and bad/stuck timestamps, so they wouldn't provide any information if kept as Invalid 3's (which are only kept for non-temporal data).

meredithspalmer commented 8 years ago

Thanks for sending me that data, @mkosmala -- here's what I found:

S9: I get the same number of rolls, images, and capture events overall and for INVALID 0, 2, & 3 Issue 1) I come up with 37 more capture events for INVALID 1 (MSP: 3,090 vs. MK: 3,053)

Issue 1) You seem to have INVALID 1 capture events only in rolls E13_R1, G01_R2, I13_R1, L10_R1, & S13_R3; I find INVALID 1 capture events in my data for E13_R1, G01_R2, I13_R1, L10_R1, S13_R3, F04_R2, & I13_R2

S10: I get the same number of rolls, images, and capture events for INVALID 0, 2, & 3. Issue 1) I come up with 1 less overall capture event (253,199 vs. 253,200) Issue 2) I come up with 5 more INVALID 1 capture events (8 vs. 3)

Issue 1) There appears to be an extra capture event in your dataset in H06_R2

Issue 2) You have INVALID 1 capture events only in rolls B04_R1, H06_R2, & M09_3; I find INVALID 1 capture events in B04_R1, H06_R2, M09_R3, I13_R2, L10_2, O12_R3, & P11_R1

mkosmala commented 8 years ago

For S9, Issue 1:

For S10, Issue 1:

For S10, Issue 2:

Also, please explain how you can have the same number of overall capture events, Invalid=0, Invalid=2, and Invalid=3 capture events, but have a different number of Invalid=1 capture events. The math doesn't add up. If you have a different number of Invalid=1 capture events, but the same number of overall capture events, then Invalid=0, 2, or 3 should be changing too...

mkosmala commented 8 years ago

Hey, I just looked at S10 I13_R2 in the cleaning scripts. Capture event number 188 has four associated images: S10/I13/I13_R2/S10_I13_R2_IMAG0449.JPG S10/I13/I13_R2/S10_I13_R2_IMAG0450.JPG S10/I13/I13_R2/S10_I13_R2_IMAG0451.JPG S10/I13/I13_R2/S10_I13_R2_IMAG0452.JPG

Only the last of these four images is marked invalid. Can you see what's going on? Is that one image (out of the entire roll) really supposed to be invalid?! If so, should it be invalidating the entire capture, or are the other three images in the capture good?

I think we need the cleaning script to address this sort of issue, rather than the uploading script. The database considers captures valid or invalid -- not images. So if the last image here is really invalid, we'd need it assigned to its own capture event, which would then be marked invalid. (Or we could just not upload its existence to the database at all.) @aliburchard opinion?

aliburchard commented 8 years ago

Hmm, I do remember this situation popping up on occasion, and I'm ambivalent. Manually creating a capture event would be a pain because it would be out of order or require re-numbering subsequent captures. On the other hand, while I'm not madly in love with having images that have no record of existing in the database, if it's a relatively rare occurrence and we document it somewhere, that might be way easier. It just means that numbers won't align perfectly for easy cross-checking of file counts vs. number of image records...

meredithspalmer commented 8 years ago

We had a bit to-do last time about capture events having more than three images in them. For these seasons, if the images had nothing in them (sticking misfires), I invalidating the >3 images in the capture event so the first three images are fine and the rest are invalid (which I'm almost 100% sure is the solution we came to last time). If the capture event have >3 images that were obviously multiple capture events stuck together, I separated them and renumbered all capture events in the roll. It would be a really big pain to make the extra images into their own capture events just to invalidate them, because of all the roll renumbering that has to happen, but we could write that into the cleaning script for future seasons -- let me know if this is the best solution.

So this seems to be the primary reason for our discrepancies:

For S10 H06_2 (S10 issue 1), according to my notes in the action list (which are posted on MSI -- might save back-and-forth time if we both consulted these first with questions? outlines all invalidations made and reasoning behind them), the first image (0061) of the 4 image CE was corrupted, so I invalidated that image, made it it's own CE, and renumber the other images (0062-0064) to be 1-3 of their own CE. It looks like the issue arose that both these CEs have the same CE number (23). We could rename all the CEs in the roll if that would solve this problem? I must have not done that in the first cleaning.

mkosmala commented 8 years ago

Okay, so if I understand it correctly, all these "extra" images aren't "wrong" per se, there are just too many images per capture, right? And all the affected captures are no-animals-present captures, also right? If so, then I don't think it makes sense to spend a lot of time removing these images from their capture events and invalidating them separately. They haven't gone to Zooniverse, which is fine, but they're not important either way for image classification. So I'm just going to leave them in their current capture events in the database. Shouldn't cause any problems (other than non-matching verification counts!)

For S10 H06_2 (S10 issue 1), renumbering the whole roll would be a headache now that it's all in the database, so let's not. I'll simply move the corrupted image to a new capture event (numbering at the end of the roll) and invalidate it. I think the current cleaned file at MSI should cause no problem for exporting to Zooniverse.

Still need to do this last step, so I'll keep this issue open for now.

meredithspalmer commented 8 years ago

Sounds good!