how important are the volunteers' agents?

Question: what happens if we do not use the agents for volunteers and give equal weights to the classifications of the volunteers? how different is our assessment of the Pvalue for a given subject? is our completeness or efficiency for finding lenses affected significantly ?

I did some of these tests early on (hence the agents_willing_to_learn keyword in the config file). The main thing that happens is that the false negative rate goes up: lenses tend to be missed if they are not seen by a skilled volunteer. We could potentially re-run SWAP on CFHTLS stage 1 with PD=PL=0.7 (say) and not have the agents learn, to give a control to compare against.

On Tue, Jun 17, 2014 at 1:15 AM, anupreeta27 notifications@github.com wrote:

Question: what happens if we do not use the agents for volunteers and give equal weights to the classifications of the volunteers? how different is our assessment of the Pvalue for a given subject? is our completeness or efficiency for finding lenses affected significantly ?

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26.

yes, a comparison plot will be nice to show in the paper that agents learning is clearly better (even though it may be obvious).

On Tue, Jun 17, 2014 at 8:26 PM, Phil Marshall notifications@github.com wrote:

I did some of these tests early on (hence the agents_willing_to_learn keyword in the config file). The main thing that happens is that the false negative rate goes up: lenses tend to be missed if they are not seen by a skilled volunteer. We could potentially re-run SWAP on CFHTLS stage 1 with PD=PL=0.7 (say) and not have the agents learn, to give a control to compare against.

On Tue, Jun 17, 2014 at 1:15 AM, anupreeta27 notifications@github.com wrote:

Question: what happens if we do not use the agents for volunteers and give equal weights to the classifications of the volunteers? how different is our assessment of the Pvalue for a given subject? is our completeness or efficiency for finding lenses affected significantly ?

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26.

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-46295142 .

Howdy,

I am currently rerunning on stage1 and stage2 with supervised and agents_willing_to_learn on and off. The only other settings I modified were to set initialPL and initialPD for unwilling agents / unsupervised to be 0.75.

"Unsupervised" in SWAP means that updating PL and PD is done ONLY on the test images, using the current mean_probability at time of evaluation. Currently I don't think the code gives an option to let you use both the test and training images.

This is in contrast with the default behavior of my offline system, which uses both. In the roc curves below I also include what would happen if the offline system also ignores the true values of the training images when updating estimates of PL, PD.

I was somewhat weirded out that the stage1 unsupervised did /better/, so I reran that with initialPL and initialPD back at 0.5 to see what happens in both stages. Stage2 is finished (see the red line for the stage2 only roc curve), but stage1 unsupervised and initially confused is still ongoing.

The stars mark the point where P = 0.95

roc_stage1 roc_stage2 roc_stageall

Good stuff - lots of interesting things here, for sure! First, can you confirm the initial PD, PL and willing_to_learn flag for each run please, as well as if its using training only, test only or training+test? At least one of my points below needs this info.

Stage 1 unsupervised, but with iPD = iPL = 0.75, does well, I agree! I guess it helps a lot that all the test images are very informative about PD, but I guess its giving everyone PL=0.75 that pushes the TPR up. When you turn on the training as well as the test images it'll do even better, I think. The unsupervised analysis wasn't an option for us last year because I hadnt worked out how to use the test images for training until this year.
Corollary is that (I think) iPD = iPL = 0.5 unsupervised ("confused" is not the best term, I think - new?) will have worse TPR maybe even than standard SWAP, because there will be so little information about lenses there, but let's see.
Did you do an online SWAP run with iPD = iPL = 0.75, supervised (training only)? My early tests showed this to have a higher false negative rate, so we didn't use it - but itd be good to double check...
Why are the stage 1 unwilling and stage 1 unwilling and unsupervised curves different? If the agents are not learning, it shouldnt matter whether they are supervised or not... Is there a bug? This is what made me think it would be worth double checking all the settings and writing them out in a table...
Stage 1 agents_unwilling basically answers Anu's question from a while back, about whether simple voting with no user weighting would have worked better. I think the answer is "no". If you give everyone equal weight via iPD = iPL = 0.75, and then don't allow these to vary, you only reach 80% TPR.
It's important to see the results for offline, training only - this is the apples to apples comparison with standard stage 1 or 2. We must save all unsupervised analysis for the follow-up (because of the chronology arguments)
What's your conclusion about offline analysis at stage 1? Worth doing in the next project, or not? The gain in TPR at high FPR is interesting, but at stage 1, we mainly care about maximizing TPR (completeness)
Are the stars at the 0.95 point on each curve? If so, its good to see that they come out o be at approximately the same place on each curve (on the vertical, just before the knee). I guess we only need to show these at stage 1, because this is the only time we used a threshold in P (to define the stage 2 sample)

Can you now please make one plot with:

Stage 1:

Online learning (M_0 = 0.5) [ie standard SWAP, the one we actually ran]
No learning (M_0 = 0.75) [ie simple voting, no user weighting - does significantly worse]
Online learning (M_0 = 0.75) [standard SWAP but more optimistic about crowd]
Offline learning (M_0 = 0.5) [offline analysis, not much gained at this stage]

Stage 2: The same categories (and matching line styles)

ie none of the unsupervised curves - they should all go in the follow-up paper. The point of this plot is to justify the choices we made in the SW CFHTLS project, and is the one I think we want to put in the paper. Thanks Chris!

On Sat, Aug 9, 2014 at 5:16 PM, cpadavis notifications@github.com wrote:

Howdy,

I am currently rerunning on stage1 and stage2 with supervised and agents_willing_to_learn on and off. The only other settings I modified were to set initialPL and initialPD for unwilling agents / unsupervised to be 0.75.

"Unsupervised" in SWAP means that updating PL and PD is done ONLY on the test images, using the current mean_probability at time of evaluation. Currently I don't think the code gives an option to let you use both the test and training images.

This is in contrast with the default behavior of my offline system, which uses both. In the roc curves below I also include what would happen if the offline system also ignores the true values of the training images when updating estimates of PL, PD.

I was somewhat weirded out that the stage1 unsupervised did /better/, so I reran that with initialPL and initialPD back at 0.5 to see what happens in both stages. Stage2 is finished (see the red line for the stage2 only roc curve), but stage1 unsupervised and initially confused is still ongoing.

[image: roc_stage1] https://cloud.githubusercontent.com/assets/1781988/3868109/4c501f02-2023-11e4-9c35-05b7ef43e0c8.png [image: roc_stage2] https://cloud.githubusercontent.com/assets/1781988/3868110/4c518680-2023-11e4-8979-99e9fac41bee.png [image: roc_stageall] https://cloud.githubusercontent.com/assets/1781988/3868111/4c5bf700-2023-11e4-9883-5edf40224808.png

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-51702202 .

Hi Phil &c,

Good stuff - lots of interesting things here, for sure! First, can you confirm the initial PD, PL and willing_to_learn flag for each run please, as well as if its using training only, test only or training+test? At least one of my points below needs this info.

for the onlines: solid line: initial PD / PL = 0.5, willing_to_learn = True, supervised = True -- so training only for updating PD/PL dotted line: initial PD / PL = 0.75, willing_to_learn = True, supervised = False -- so test only for updating PD/PL dashed line: initial PD / PL = 0.75, willing_to_learn = False, supervised = True -- so training only for updating PD/PL dot dash: initial PD / PL = 0.75, willing_to_learn = False, supervised = False -- so test only for updating PD/PL

offline: dashed: initial PD / PL = 0.75, equivalent to willing_to_learn = True, uses training+test dotted: initial PD / PL = 0.75, equivalent to willing_to_learn = True, considers trainings equivalent to test info (does NOT discard them like above, but treats them as "unknowns" like the test images) yes I realize this means that for issue 22 (I think that's the number) one of the things to be done is to run with the trainings totally excluded and the tests totally excluded in order to make apples to apples comparisons. What's one more pickle I guess (:

Corollary is that (I think) iPD = iPL = 0.5 unsupervised ("confused" is not the best term, I think - new?) will have worse TPR maybe even than standard SWAP, because there will be so little information about lenses there, but let's see.

I hope so (I keep getting a memory error even after making more than enough room in my user space for this run!).

Maybe a better term than confused is "suspicious"?

Did you do an online SWAP run with iPD = iPL = 0.75, supervised (training only)? My early tests showed this to have a higher false negative rate, so we didn't use it - but itd be good to double check...

I have not yet -- do you want both stage1 and stage2, or would stage2 suffice?

Why are the stage 1 unwilling and stage 1 unwilling and unsupervised curves different? If the agents are not learning, it shouldnt matter whether they are supervised or not... Is there a bug? This is what made me think it would be worth double checking all the settings and writing them out in a table...

Good catch! I may have run these before the update to agent.py; I'll do that tonight to be sure.

What's your conclusion about offline analysis at stage 1? Worth doing in the next project, or not? The gain in TPR at high FPR is interesting, but at stage 1, we mainly care about maximizing TPR (completeness)

I think offline analysis benefits from a few things versus the original online system:

It allows you to assess each user at the end of all their evaluations, which is most useful if you are potentially throwing out information while your agent learns about its user. I think this ends up benefitting stage1 the most, since you really have a much better idea of a user after, say, 10 images versus 4, than 100 vs 10.
It incorporates information from the test images as well as the training images, which is particularly useful when this means that you have significantly more lenses, as in stage2.

(I feel like I had a third point in mind.) I think with these two it makes it a bit of a wash about clearcut benefits, especially since the concern is about retiring images as quickly as possible to turn attention to useful lenses. To that end, maybe the right thing to do would be that each time you go through a "retirement" phase you also rerun the offline analysis. It will probably find the same set of images that should be retired / promoted, with some extra set in each group depending on your thresholding. Those ones that aren't in both sets (or at least the ones that are in the online but not the offline) are the ones worth saving for future examination.

If in stage1 we care about maximizing the TPR, then isn't it OK that you also get a high FPR? -- stage2 can then be used to cull the FPs...

Are the stars at the 0.95 point on each curve? If so, its good to see that they come out o be at approximately the same place on each curve (on the vertical, just before the knee). I guess we only need to show these at stage 1, because this is the only time we used a threshold in P (to define the stage 2 sample)

Yes you're right.

Can you now please make one plot with:

Stage 1:

Online learning (M_0 = 0.5) [ie standard SWAP, the one we actually ran]

No learning (M_0 = 0.75) [ie simple voting, no user weighting - does significantly worse]

Online learning (M_0 = 0.75) [standard SWAP but more optimistic about crowd]

Offline learning (M_0 = 0.5) [offline analysis, not much gained at this stage]

Stage 2: The same categories (and matching line styles)

ie none of the unsupervised curves - they should all go in the follow-up paper. The point of this plot is to justify the choices we made in the SW CFHTLS project, and is the one I think we want to put in the paper. Thanks Chris!

on it!

Oh so should I also put the resulting catalogs for the offline analysis on the github? Or just move it someplace I can link to? They're currently on the slac ki-ls machines at /nfs/slac/g/ki/ki18/cpd/swap/pickles/ but I don't think I have enough space to share them in my public afs space...

---chris

(I'm at, I think, 18 different runs now and trying to keep track of the different permutations we wanted to check out, so if there seem to be an extra line or two in some of the plots below, that is probably why! The new terminology here: suspicious -> initial PL, PD = 0.5; optimist -> initial PL, PD = 0.75)

So on the difference between unwilling and unwilling_unsupervised:

I reran after pulling again, and there were still small differences. Could the differences be due to the different NT/ND/NL values between looking at test vs training? (for the unsupervised, the incrementing of ND etc depend on the current mean_probability). I also wondered if maybe I was messing up with the random_state.pickle file, so I reran the unwilling_unsupervised again for stage2, and it made a very small difference:

stage2_all

and for stage1:

stage1_all

(I seem to have fixed my old problems with convergence from initial starting PL and PD values too; offline_suspicious is the offline with PL and PD at 0.5. I'm not really sure why or how, but I'm going to just roll with it for now.

EDIT: that actually isn't quite right. Initialization matters if you aren't using the known true values from the training images -- see Issue #22 )

Here are the plots you requested earlier

stage1_request stage2_request

Great, thanks Chris! I added the two bottom plots to the paper and added some text, please check my conclusions. You can refine the numbers if you like.

I think the next thing to do is plot completeness against purity (I explain how in the text), to be able to draw conclusions that astronomers understand. Purity in particular is important, as its quite different from FPR. I think we might show this with both stage 1 and 2 on the same plot, what do you think?

So: offline stage 2 seems to give a very pure sample, no? 85% TPR at 0% FPR... Interested to see how this translates to the test subjects!

Re: your other plots: seems like the big (well 5%) gain in stage 1 TPR is switching to "unsupervised". Once again, can you remind me exactly what this means, please? I guess you might start putting together the plots you'd want to show in the follow up study too. Nice work!

On Tue, Aug 12, 2014 at 12:57 AM, cpadavis notifications@github.com wrote:

(I'm at, I think, 18 different runs now and trying to keep track of the different permutations we wanted to check out, so if there seem to be an extra line or two in some of the plots below, that is probably why! The new terminology here: suspicious -> initial PL, PD = 0.5; optimist -> initial PL, PD = 0.75)

So on the difference between unwilling and unwilling_unsupervised:

I reran after pulling again, and there were still small differences. Could the differences be due to the different NT/ND/NL values between looking at test vs training? (for the unsupervised, the incrementing of ND etc depend on the current mean_probability). I also wondered if maybe I was messing up with the random_state.pickle file, so I reran the unwilling_unsupervised again for stage2, and it made a very small difference:

[image: stage2_all] https://cloud.githubusercontent.com/assets/1781988/3887719/f9371f58-21f5-11e4-9301-6f1e80d34a78.png

and for stage1:

[image: stage1_all] https://cloud.githubusercontent.com/assets/1781988/3887718/f3d7927c-21f5-11e4-88e3-bd8046540d54.png

(I seem to have fixed my old problems with convergence from initial starting PL and PD values too; offline_suspicious is the offline with PL and PD at 0.5. I'm not really sure why or how, but I'm going to just roll with it for now.)

Here are the plots you requested earlier:

[image: stage1_request] https://cloud.githubusercontent.com/assets/1781988/3887714/ef0c1876-21f5-11e4-8ba5-bdd10ffd432b.png

[image: stage2_request] https://cloud.githubusercontent.com/assets/1781988/3887715/f22aa176-21f5-11e4-9c4b-69e572677b6d.png

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-51884273 .

Excellent, I will take a look at the paper.

Re: unsupervised. This simply means I set the "supervised" parameter in the config to 0. (So this is entirely within the online system.) Here is my understanding of what that does: With supervised set to 1, updates to a user's PL or PD are done every time a user assesses a training image. When it is set to 0, updates to a user's PL and PD are done with each test image (but not with the training images; in the case where we assume that feedback does nothing (so a user comes in fully educated), it's as if the training images don't exist and the survey was done entirely with the unknown data). This update is based on the current mean_probability of an image -- so the updates to PD and PL are weighted by this mean_probability (for a 70% likely lens, an evaluation of a lens is treated as being 70% correct, 30% incorrect).

On Thu, Aug 14, 2014 at 12:01 AM, Phil Marshall notifications@github.com wrote:

Closed #26 https://github.com/drphilmarshall/SpaceWarps/issues/26.

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#event-152725132.

OK, got it. And for the curve you showed, was that with initial PD and PL of 0.5, or 0.75?

On Thu, Aug 14, 2014 at 12:11 AM, cpadavis notifications@github.com wrote:

Excellent, I will take a look at the paper.

Re: unsupervised. This simply means I set the "supervised" parameter in the config to 0. (So this is entirely within the online system.) Here is my understanding of what that does: With supervised set to 1, updates to a user's PL or PD are done every time a user assesses a training image. When it is set to 0, updates to a user's PL and PD are done with each test image (but not with the training images; in the case where we assume that feedback does nothing (so a user comes in fully educated), it's as if the training images don't exist and the survey was done entirely with the unknown data). This update is based on the current mean_probability of an image -- so the updates to PD and PL are weighted by this mean_probability (for a 70% likely lens, an evaluation of a lens is treated as being 70% correct, 30% incorrect).

On Thu, Aug 14, 2014 at 12:01 AM, Phil Marshall notifications@github.com

wrote:

Closed #26 https://github.com/drphilmarshall/SpaceWarps/issues/26.

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#event-152725132.

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-52150804 .

the stage1_unsupervised started at initialPL, initialPD = 0.75. For some reason the stage1 with PL and PD at 0.5 keeps failing at the pickle-saving stage, complaining of a memory allocation error (I'll rerun again and copy the failure when it happens).

On Thu, Aug 14, 2014 at 12:16 AM, Phil Marshall notifications@github.com wrote:

OK, got it. And for the curve you showed, was that with initial PD and PL of 0.5, or 0.75?

On Thu, Aug 14, 2014 at 12:11 AM, cpadavis notifications@github.com wrote:

Excellent, I will take a look at the paper.

Re: unsupervised. This simply means I set the "supervised" parameter in the config to 0. (So this is entirely within the online system.) Here is my understanding of what that does: With supervised set to 1, updates to a user's PL or PD are done every time a user assesses a training image. When it is set to 0, updates to a user's PL and PD are done with each test image (but not with the training images; in the case where we assume that feedback does nothing (so a user comes in fully educated), it's as if the training images don't exist and the survey was done entirely with the unknown data). This update is based on the current mean_probability of an image -- so the updates to PD and PL are weighted by this mean_probability (for a 70% likely lens, an evaluation of a lens is treated as being 70% correct, 30% incorrect).

On Thu, Aug 14, 2014 at 12:01 AM, Phil Marshall < notifications@github.com>

wrote:

Closed #26 https://github.com/drphilmarshall/SpaceWarps/issues/26.

— Reply to this email directly or view it on GitHub < https://github.com/drphilmarshall/SpaceWarps/issues/26#event-152725132>.

— Reply to this email directly or view it on GitHub < https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-52150804>

.

— Reply to this email directly or view it on GitHub https://github.com/drphilmarshall/SpaceWarps/issues/26#issuecomment-52151150 .

I reran with initialPL and PD at 0.5 and it is nearly indistinguishable with the stage 1_unsupervised line. I think that settles discussion in here with future discussion in #62

drphilmarshall / SpaceWarps

how important are the volunteers' agents? #26