drphilmarshall / SpaceWarps

Science Team Website Development and Analysis
MIT License
12 stars 18 forks source link

how important are the volunteers' agents? #26

Closed anupreeta27 closed 10 years ago

anupreeta27 commented 10 years ago

Question: what happens if we do not use the agents for volunteers and give equal weights to the classifications of the volunteers? how different is our assessment of the Pvalue for a given subject? is our completeness or efficiency for finding lenses affected significantly ?

drphilmarshall commented 10 years ago

I did some of these tests early on (hence the agents_willing_to_learn keyword in the config file). The main thing that happens is that the false negative rate goes up: lenses tend to be missed if they are not seen by a skilled volunteer. We could potentially re-run SWAP on CFHTLS stage 1 with PD=PL=0.7 (say) and not have the agents learn, to give a control to compare against.

On Tue, Jun 17, 2014 at 1:15 AM, anupreeta27 notifications@github.com wrote:

Question: what happens if we do not use the agents for volunteers and give equal weights to the classifications of the volunteers? how different is our assessment of the Pvalue for a given subject? is our completeness or efficiency for finding lenses affected significantly ?

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26.

anupreeta27 commented 10 years ago

yes, a comparison plot will be nice to show in the paper that agents learning is clearly better (even though it may be obvious).

On Tue, Jun 17, 2014 at 8:26 PM, Phil Marshall notifications@github.com wrote:

I did some of these tests early on (hence the agents_willing_to_learn keyword in the config file). The main thing that happens is that the false negative rate goes up: lenses tend to be missed if they are not seen by a skilled volunteer. We could potentially re-run SWAP on CFHTLS stage 1 with PD=PL=0.7 (say) and not have the agents learn, to give a control to compare against.

On Tue, Jun 17, 2014 at 1:15 AM, anupreeta27 notifications@github.com wrote:

Question: what happens if we do not use the agents for volunteers and give equal weights to the classifications of the volunteers? how different is our assessment of the Pvalue for a given subject? is our completeness or efficiency for finding lenses affected significantly ?

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26.

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-46295142 .

cpadavis commented 10 years ago

Howdy,

I am currently rerunning on stage1 and stage2 with supervised and agents_willing_to_learn on and off. The only other settings I modified were to set initialPL and initialPD for unwilling agents / unsupervised to be 0.75.

"Unsupervised" in SWAP means that updating PL and PD is done ONLY on the test images, using the current mean_probability at time of evaluation. Currently I don't think the code gives an option to let you use both the test and training images.

This is in contrast with the default behavior of my offline system, which uses both. In the roc curves below I also include what would happen if the offline system also ignores the true values of the training images when updating estimates of PL, PD.

I was somewhat weirded out that the stage1 unsupervised did /better/, so I reran that with initialPL and initialPD back at 0.5 to see what happens in both stages. Stage2 is finished (see the red line for the stage2 only roc curve), but stage1 unsupervised and initially confused is still ongoing.

The stars mark the point where P = 0.95

roc_stage1 roc_stage2 roc_stageall

drphilmarshall commented 10 years ago

Good stuff - lots of interesting things here, for sure! First, can you confirm the initial PD, PL and willing_to_learn flag for each run please, as well as if its using training only, test only or training+test? At least one of my points below needs this info.

Can you now please make one plot with:

Stage 1:

Stage 2: The same categories (and matching line styles)

ie none of the unsupervised curves - they should all go in the follow-up paper. The point of this plot is to justify the choices we made in the SW CFHTLS project, and is the one I think we want to put in the paper. Thanks Chris!

On Sat, Aug 9, 2014 at 5:16 PM, cpadavis notifications@github.com wrote:

Howdy,

I am currently rerunning on stage1 and stage2 with supervised and agents_willing_to_learn on and off. The only other settings I modified were to set initialPL and initialPD for unwilling agents / unsupervised to be 0.75.

"Unsupervised" in SWAP means that updating PL and PD is done ONLY on the test images, using the current mean_probability at time of evaluation. Currently I don't think the code gives an option to let you use both the test and training images.

This is in contrast with the default behavior of my offline system, which uses both. In the roc curves below I also include what would happen if the offline system also ignores the true values of the training images when updating estimates of PL, PD.

I was somewhat weirded out that the stage1 unsupervised did /better/, so I reran that with initialPL and initialPD back at 0.5 to see what happens in both stages. Stage2 is finished (see the red line for the stage2 only roc curve), but stage1 unsupervised and initially confused is still ongoing.

[image: roc_stage1] https://cloud.githubusercontent.com/assets/1781988/3868109/4c501f02-2023-11e4-9c35-05b7ef43e0c8.png [image: roc_stage2] https://cloud.githubusercontent.com/assets/1781988/3868110/4c518680-2023-11e4-8979-99e9fac41bee.png [image: roc_stageall] https://cloud.githubusercontent.com/assets/1781988/3868111/4c5bf700-2023-11e4-9883-5edf40224808.png

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-51702202 .

cpadavis commented 10 years ago

Hi Phil &c,

Good stuff - lots of interesting things here, for sure! First, can you confirm the initial PD, PL and willing_to_learn flag for each run please, as well as if its using training only, test only or training+test? At least one of my points below needs this info.

for the onlines: solid line: initial PD / PL = 0.5, willing_to_learn = True, supervised = True -- so training only for updating PD/PL dotted line: initial PD / PL = 0.75, willing_to_learn = True, supervised = False -- so test only for updating PD/PL dashed line: initial PD / PL = 0.75, willing_to_learn = False, supervised = True -- so training only for updating PD/PL dot dash: initial PD / PL = 0.75, willing_to_learn = False, supervised = False -- so test only for updating PD/PL

offline: dashed: initial PD / PL = 0.75, equivalent to willing_to_learn = True, uses training+test dotted: initial PD / PL = 0.75, equivalent to willing_to_learn = True, considers trainings equivalent to test info (does NOT discard them like above, but treats them as "unknowns" like the test images) yes I realize this means that for issue 22 (I think that's the number) one of the things to be done is to run with the trainings totally excluded and the tests totally excluded in order to make apples to apples comparisons. What's one more pickle I guess (:

  • Corollary is that (I think) iPD = iPL = 0.5 unsupervised ("confused" is not the best term, I think - new?) will have worse TPR maybe even than standard SWAP, because there will be so little information about lenses there, but let's see.

I hope so (I keep getting a memory error even after making more than enough room in my user space for this run!).

Maybe a better term than confused is "suspicious"?

  • Did you do an online SWAP run with iPD = iPL = 0.75, supervised (training only)? My early tests showed this to have a higher false negative rate, so we didn't use it - but itd be good to double check...

I have not yet -- do you want both stage1 and stage2, or would stage2 suffice?

  • Why are the stage 1 unwilling and stage 1 unwilling and unsupervised curves different? If the agents are not learning, it shouldnt matter whether they are supervised or not... Is there a bug? This is what made me think it would be worth double checking all the settings and writing them out in a table...

Good catch! I may have run these before the update to agent.py; I'll do that tonight to be sure.

  • What's your conclusion about offline analysis at stage 1? Worth doing in the next project, or not? The gain in TPR at high FPR is interesting, but at stage 1, we mainly care about maximizing TPR (completeness)

I think offline analysis benefits from a few things versus the original online system:

(I feel like I had a third point in mind.) I think with these two it makes it a bit of a wash about clearcut benefits, especially since the concern is about retiring images as quickly as possible to turn attention to useful lenses. To that end, maybe the right thing to do would be that each time you go through a "retirement" phase you also rerun the offline analysis. It will probably find the same set of images that should be retired / promoted, with some extra set in each group depending on your thresholding. Those ones that aren't in both sets (or at least the ones that are in the online but not the offline) are the ones worth saving for future examination.

If in stage1 we care about maximizing the TPR, then isn't it OK that you also get a high FPR? -- stage2 can then be used to cull the FPs...

  • Are the stars at the 0.95 point on each curve? If so, its good to see that they come out o be at approximately the same place on each curve (on the vertical, just before the knee). I guess we only need to show these at stage 1, because this is the only time we used a threshold in P (to define the stage 2 sample)

Yes you're right.

Can you now please make one plot with:

Stage 1:

  • Online learning (M_0 = 0.5) [ie standard SWAP, the one we actually ran]
  • No learning (M_0 = 0.75) [ie simple voting, no user weighting - does significantly worse]
  • Online learning (M_0 = 0.75) [standard SWAP but more optimistic about crowd]
  • Offline learning (M_0 = 0.5) [offline analysis, not much gained at this stage]

Stage 2: The same categories (and matching line styles)

ie none of the unsupervised curves - they should all go in the follow-up paper. The point of this plot is to justify the choices we made in the SW CFHTLS project, and is the one I think we want to put in the paper. Thanks Chris!

on it!

Oh so should I also put the resulting catalogs for the offline analysis on the github? Or just move it someplace I can link to? They're currently on the slac ki-ls machines at /nfs/slac/g/ki/ki18/cpd/swap/pickles/ but I don't think I have enough space to share them in my public afs space...

---chris

cpadavis commented 10 years ago

(I'm at, I think, 18 different runs now and trying to keep track of the different permutations we wanted to check out, so if there seem to be an extra line or two in some of the plots below, that is probably why! The new terminology here: suspicious -> initial PL, PD = 0.5; optimist -> initial PL, PD = 0.75)

So on the difference between unwilling and unwilling_unsupervised:

I reran after pulling again, and there were still small differences. Could the differences be due to the different NT/ND/NL values between looking at test vs training? (for the unsupervised, the incrementing of ND etc depend on the current mean_probability). I also wondered if maybe I was messing up with the random_state.pickle file, so I reran the unwilling_unsupervised again for stage2, and it made a very small difference:

stage2_all

and for stage1:

stage1_all

(I seem to have fixed my old problems with convergence from initial starting PL and PD values too; offline_suspicious is the offline with PL and PD at 0.5. I'm not really sure why or how, but I'm going to just roll with it for now.

EDIT: that actually isn't quite right. Initialization matters if you aren't using the known true values from the training images -- see Issue #22 )

Here are the plots you requested earlier

stage1_request stage2_request

drphilmarshall commented 10 years ago

Great, thanks Chris! I added the two bottom plots to the paper and added some text, please check my conclusions. You can refine the numbers if you like.

I think the next thing to do is plot completeness against purity (I explain how in the text), to be able to draw conclusions that astronomers understand. Purity in particular is important, as its quite different from FPR. I think we might show this with both stage 1 and 2 on the same plot, what do you think?

So: offline stage 2 seems to give a very pure sample, no? 85% TPR at 0% FPR... Interested to see how this translates to the test subjects!

Re: your other plots: seems like the big (well 5%) gain in stage 1 TPR is switching to "unsupervised". Once again, can you remind me exactly what this means, please? I guess you might start putting together the plots you'd want to show in the follow up study too. Nice work!

On Tue, Aug 12, 2014 at 12:57 AM, cpadavis notifications@github.com wrote:

(I'm at, I think, 18 different runs now and trying to keep track of the different permutations we wanted to check out, so if there seem to be an extra line or two in some of the plots below, that is probably why! The new terminology here: suspicious -> initial PL, PD = 0.5; optimist -> initial PL, PD = 0.75)

So on the difference between unwilling and unwilling_unsupervised:

I reran after pulling again, and there were still small differences. Could the differences be due to the different NT/ND/NL values between looking at test vs training? (for the unsupervised, the incrementing of ND etc depend on the current mean_probability). I also wondered if maybe I was messing up with the random_state.pickle file, so I reran the unwilling_unsupervised again for stage2, and it made a very small difference:

[image: stage2_all] https://cloud.githubusercontent.com/assets/1781988/3887719/f9371f58-21f5-11e4-9301-6f1e80d34a78.png

and for stage1:

[image: stage1_all] https://cloud.githubusercontent.com/assets/1781988/3887718/f3d7927c-21f5-11e4-88e3-bd8046540d54.png

(I seem to have fixed my old problems with convergence from initial starting PL and PD values too; offline_suspicious is the offline with PL and PD at 0.5. I'm not really sure why or how, but I'm going to just roll with it for now.)

Here are the plots you requested earlier:

[image: stage1_request] https://cloud.githubusercontent.com/assets/1781988/3887714/ef0c1876-21f5-11e4-8ba5-bdd10ffd432b.png

[image: stage2_request] https://cloud.githubusercontent.com/assets/1781988/3887715/f22aa176-21f5-11e4-9c4b-69e572677b6d.png

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-51884273 .

cpadavis commented 10 years ago

Excellent, I will take a look at the paper.

Re: unsupervised. This simply means I set the "supervised" parameter in the config to 0. (So this is entirely within the online system.) Here is my understanding of what that does: With supervised set to 1, updates to a user's PL or PD are done every time a user assesses a training image. When it is set to 0, updates to a user's PL and PD are done with each test image (but not with the training images; in the case where we assume that feedback does nothing (so a user comes in fully educated), it's as if the training images don't exist and the survey was done entirely with the unknown data). This update is based on the current mean_probability of an image -- so the updates to PD and PL are weighted by this mean_probability (for a 70% likely lens, an evaluation of a lens is treated as being 70% correct, 30% incorrect).

On Thu, Aug 14, 2014 at 12:01 AM, Phil Marshall notifications@github.com wrote:

Closed #26 https://github.com/drphilmarshall/SpaceWarps/issues/26.

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#event-152725132.

drphilmarshall commented 10 years ago

OK, got it. And for the curve you showed, was that with initial PD and PL of 0.5, or 0.75?

On Thu, Aug 14, 2014 at 12:11 AM, cpadavis notifications@github.com wrote:

Excellent, I will take a look at the paper.

Re: unsupervised. This simply means I set the "supervised" parameter in the config to 0. (So this is entirely within the online system.) Here is my understanding of what that does: With supervised set to 1, updates to a user's PL or PD are done every time a user assesses a training image. When it is set to 0, updates to a user's PL and PD are done with each test image (but not with the training images; in the case where we assume that feedback does nothing (so a user comes in fully educated), it's as if the training images don't exist and the survey was done entirely with the unknown data). This update is based on the current mean_probability of an image -- so the updates to PD and PL are weighted by this mean_probability (for a 70% likely lens, an evaluation of a lens is treated as being 70% correct, 30% incorrect).

On Thu, Aug 14, 2014 at 12:01 AM, Phil Marshall notifications@github.com

wrote:

Closed #26 https://github.com/drphilmarshall/SpaceWarps/issues/26.

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#event-152725132.

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-52150804 .

cpadavis commented 10 years ago

the stage1_unsupervised started at initialPL, initialPD = 0.75. For some reason the stage1 with PL and PD at 0.5 keeps failing at the pickle-saving stage, complaining of a memory allocation error (I'll rerun again and copy the failure when it happens).

On Thu, Aug 14, 2014 at 12:16 AM, Phil Marshall notifications@github.com wrote:

OK, got it. And for the curve you showed, was that with initial PD and PL of 0.5, or 0.75?

On Thu, Aug 14, 2014 at 12:11 AM, cpadavis notifications@github.com wrote:

Excellent, I will take a look at the paper.

Re: unsupervised. This simply means I set the "supervised" parameter in the config to 0. (So this is entirely within the online system.) Here is my understanding of what that does: With supervised set to 1, updates to a user's PL or PD are done every time a user assesses a training image. When it is set to 0, updates to a user's PL and PD are done with each test image (but not with the training images; in the case where we assume that feedback does nothing (so a user comes in fully educated), it's as if the training images don't exist and the survey was done entirely with the unknown data). This update is based on the current mean_probability of an image -- so the updates to PD and PL are weighted by this mean_probability (for a 70% likely lens, an evaluation of a lens is treated as being 70% correct, 30% incorrect).

On Thu, Aug 14, 2014 at 12:01 AM, Phil Marshall < notifications@github.com>

wrote:

Closed #26 https://github.com/drphilmarshall/SpaceWarps/issues/26.

— Reply to this email directly or view it on GitHub < https://github.com/drphilmarshall/SpaceWarps/issues/26#event-152725132>.

— Reply to this email directly or view it on GitHub < https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-52150804>

.

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-52151150 .

cpadavis commented 10 years ago

I reran with initialPL and PD at 0.5 and it is nearly indistinguishable with the stage 1_unsupervised line. I think that settles discussion in here with future discussion in #62