davidskalinder / mpeds-coder

MPEDS Annotation Interface
MIT License
0 stars 0 forks source link

Fully understand article assignment and assign first batch #41

Closed davidskalinder closed 4 years ago

davidskalinder commented 4 years ago

Copying from emails:

@davidskalinder sez:

Having thought more about this I wanted to check something. I think the three things that we want the assignment-to-triples strategy to accomplish are:

  • Randomize which coders get which articles (within sampling pools, such as years, if we want)
  • Randomize which coders get the same articles as other coders
  • Ensure that each article gets three coders

So I'm wondering what the reasons are not to use a simpler strategy such as merely randomly assigning articles from a pool to coders who need them and keeping articles in the pool until they have been covered by three coders? It seems like that would accomplish the goals above, since (I think) each article would just get three random draws of coders without replacement? And it seems like it's actually more flexible, since coder dropouts can then simply be handled by another round of the process for available coders? The only caution would be that we'd need to ensure that each sampling pool contains enough articles that the random draws aren't constrained (for example, assigning 100 articles from a pool of 100 assigns them all to the first coder with certainty). I'm not sure exactly what the minimum pool size is, but I think it's no more than a*c/3 where a is the number of articles to assign and c is the number of coders (because c/3 guarantees exactly one assignment per coder with triple-coverage)...

Having said all that, I keep hearing Myra's voice in the back of my head saying "Randomization is hard! Beware naive solutions!"; so I wanted to check to make sure I'm not ignoring some nuance here that could ruin everything. Am I missing something? (If so, maybe there's a reading somewhere about how to do this without overlooking whatever I'm overlooking?)

@olderwoman replies:

The question I think is how to control that in any given week, we’ve had multiple coders on the same articles so we can check them against each other. We don’t want to wait until two years into the project to see how the coders are doing. I’m not 100% sure I understand what you are thinking, but this I think could be accomplished by something like you are saying:

At time N get a pool of articles to be assigned next. When a coder is ready for more articles, randomly assign them n from the pool, depending on their pace; mark number of coders an article has been assigned to. And remove articles from the pool for assigning after 3 coders have been randomly assigned to it. I’m not sure how easy such an algorithm would be to implement.

At some point I’d get down to 2 coders per article and possibly even one.

@davidskalinder replies:

Yes, I think that's what I'm thinking as well. I think the trick to ensuring that we'd get lots of overlaps relatively quickly would be to keep the selection pools relatively small (so, to assign 50 articles to a coder, use a pool of like 400 instead of like 6,000). But if you don't see any problem with the approach below, then I think that's the logic I'll aim for in whatever I set up? I think the only implementation difficulty will be keeping track of how many times an article has been assigned, which I don't think should be too tough...

davidskalinder commented 4 years ago

The more I look at this, the more I think I should leave the current setup as it is. There's a lot of code in MAI to handle this (a whole library, in fact! Not that that's all that unusual in general, but it is for MAI), and it looks like it should be working the same as it ever did.

So I think a better plan will be to review the code to make sure I understand what it's doing and then to properly enable sorting queues by date: first by fixing #39, then by sticking an ORDER BY line into whichever query is actually inserting the queue items...

davidskalinder commented 4 years ago

All right, I think I understand how everything works now. Here's the backbone of the code, with comments:

            bins       = assign_lib.createBins(user_ids, group_size)

This creates all possible combinations ("bins") of size group_size chosen from the group of selected user_ids. The order of the bins is randomized.

            num_sample = assign_lib.generateSampleNumberForBins(num,             
len(user_ids), group_size)

This finds the smallest number of articles necessary to evenly cover all the bins, given how many articles are assigned to each coder. For example, 12 coders with triple coverage means 220 bins; to cover 220 bins evenly requires at least 55 articles per coder. So if the admin user requests between 0 and 54 articles per user, num_sample will be 0 and nothing will be assigned; if the admin requests 55-109, num_sample will be 220 and 220 will be assigned; if admin requests 110-164, it'll be 440; and so on.

            if db_name:
                articles = assign_lib.generateSample(num_sample, db_name,        
pass_number = 'ec')

This is the one we'll probably always use (by specifying which article database we want). This randomly samples num_sample articles from all the articles that aren't in anybody's queue yet and returns those article IDs in a random order.

            else:
                articles = assign_lib.generateSample(num, None, 'ec', pub)

I doubt we'll use this one, but this gets articles when the publication name is specified. NB that this list gets sorted by (MAI) ID for some reason.

            assign_lib.assignmentToBin(articles, bins, pass_number = 'ec')

This assigns the first article sampled above, assigns it to everybody in the first bin, then moves on to the second article and the second bin and does the same, then to the third, and so on. The articles will come out even because generateSampleNumberForBins always returns a number that evenly covers all possible bins.

davidskalinder commented 4 years ago

So, the big issue here tech-wise is that none of the existing code can handle limiting the pool of articles to choose from: it will always sample evenly from all unassigned articles. So if we want coders to get temporally clustered queues in order to gain expertise, we need to do something new. That leaves us with several implementation options:

  1. I could fix issue #39 and #1, with moderate time/effort. Then, one of the following:
    1. Build in proper functionality to specify a limited article pool, either:
      1. Manually, or
      2. With some clever algorithm to build temporally-clustered pools.
    2. Work around the clustering issue by only adding to the production database a small range of articles (say, from 1993 only). When we're done with those, we can assign a new temporal block.
  2. I could, as I originally planned, follow the logic of the code above but do it manually in spreadsheets with the article IDs. This would let me manually limit the articles to a certain date range, and let me sort the resulting lists by date.

Option 1i is not achievable this week, so that leaves options 1ii and 2. I think the time needed for each of these is roughly the same, so I'm inclined to do 1ii since some of that will be useful going forward, whereas 2 is a strictly temporary fix. @olderwoman, do you have a strong opinion? Note that I think the whole team needs to consider some larger study-design implications beyond these strictly technical points, so I'm going to make another comment for those and mention everybody.

olderwoman commented 4 years ago

I agree this is a tough one and it is hard to figure out all the pluses and minuses of the different options.

From: davidskalinder notifications@github.com Sent: Tuesday, February 11, 2020 6:50 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Grok article assignment (#41)

So, the big issue here tech-wise is that none of the existing code can handle limiting the pool of articles to choose from: it will always sample evenly from all unassigned articles. So if we want coders to get temporally clustered queues in order to gain expertise, we need to do something new. That leaves us with several implementation options:

  1. I could fix issue #39https://github.com/davidskalinder/mpeds-coder/issues/39 and #1https://github.com/davidskalinder/mpeds-coder/issues/1, with moderate time/effort. Then, one of the following:
    • Build in proper functionality to specify a limited article pool, either:
      • Manually, or
      • With some clever algorithm to build temporally-clustered pools.
    • Work around the clustering issue by only adding to the production database a small range of articles (say, from 1993 only). When we're done with those, we can assign a new temporal block.
  2. I could, as I originally planned, follow the logic of the code above but do it manually in spreadsheets with the article IDs. This would let me manually limit the articles to a certain date range, and let me sort the resulting lists by date.

Option 1i is not achievable this week, so that leaves options 1ii and 2. I think the time needed for each of these is roughly the same, so I'm inclined to do 1ii since some of that will be useful going forward, whereas 2 is a strictly temporary fix. @olderwomanhttps://github.com/olderwoman, do you have a strong opinion? Note that I think the whole team needs to consider some larger study-design implications beyond these strictly technical points, so I'm going to make another comment for those and mention everybody.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/41?email_source=notifications&email_token=ADBJJ5OGGQ64ECF5V5DFU23RCNBS7A5CNFSM4KTFM7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELO2QTI#issuecomment-584951885, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5MWN2FRJ2RNRBZFTMTRCNBS7ANCNFSM4KTFM7HA.

davidskalinder commented 4 years ago

Okay, so @olderwoman, @matthewsmorgan, and @limchaeyoon, I think I'm to the point where I need a team consensus on this one. I've outlined above several tech options for article assignment, but to implement any of them requires decisions that I think are study-design decisions rather than just tech decisions, so I wanted to check with y'all. Here's what I think we need to decide:

  1. Do we indeed want to assign coders to temporally-clustered blocks of articles (e.g., a coder gets a bunch of articles from 1994) as we've discussed? I think this will improve coder comprehension, speed, and life satisfaction, by providing more easily-followed historical context; but on the other hand it will introduce some bias since coders will see certain time periods now when they're less good at coding and will see others later when they're better at coding.
    1. If we do want to assign them to temporal blocks, do we want to assign them all to the same block at the same time, or to different blocks?
      1. If to different blocks, how do we split them? (Bearing in mind that it sounds like we want a fair amount of overlap between coders so we can check their work against each other.)
  2. Do we indeed want to sort coders' queues by date? I think the same issues apply here as apply to assigning them to temporal blocks, but sorting will affect things within blocks rather than across blocks.
  3. What's our plan for handling dropouts? The assignment mechanism will guarantee even and random coverage by all coders that are specified at the time of assignment, but of course if they drop afterward then we'll have a bunch of articles with non-random partial coverage, and then future assignments of course won't include the dropouts. This probably affects how many articles we want to assign at once: assigning a lot would be nice since it guarantees more even coverage across the data, but on the other hand that makes worse the problem of within-assignment gaps due to dropouts.
  4. How careful do we need to be about the fact that we're planning to start with triple-coverage and later move to double- or single-coverage of articles? Do we need to plan this into a randomization strategy now?

I realize(/hope?) that the answer to all of these might be "it doesn't matter since we're going to review everything in pass 2 anyway", but I thought it was worth checking since we need to make a call on all of these points before I can start getting articles loaded for the team by Friday.

Of course please let me know if any of that isn't clear...

olderwoman commented 4 years ago

These are hard design issues. Coders WILL GRADUATE and QUIT. We WILL hire new coders. So as I think about it, perfect balancing is impossible in the longer run. Randomly assigning articles to coders eliminate the correlation between coder and event era. Assigning by date builds in a correlation between coder and era. There is just no way around that. We can have one or the other but not both.

The advantage to random is that we are not dependent on one coder’s definition of an event, we are more likely to find “all” events. The disadvantage of random is that it is easier to code events in sequence so you know what was going on.

Besides date we also have newspaper; there are 20 different publications with different numbers of articles. I can print out the table from the data file I have, I think. If so, I’ll send that around.

From: davidskalinder notifications@github.com Sent: Tuesday, February 11, 2020 7:35 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Grok article assignment (#41)

Okay, so @olderwomanhttps://github.com/olderwoman, @matthewsmorganhttps://github.com/matthewsmorgan, and @limchaeyoonhttps://github.com/limchaeyoon, I think I'm to the point where I need a team consensus on this one. I've outlined above several tech options for article assignment, but to implement any of them requires decisions that I think are study-design decisions rather than just tech decisions, so I wanted to check with y'all. Here's what I think we need to decide:

  1. Do we indeed want to assign coders to temporally-clustered blocks of articles (e.g., a coder gets a bunch of articles from 1994) as we've discussed? I think this will improve coder comprehension, speed, and life satisfaction, by providing more easily-followed historical context; but on the other hand it will introduce some bias since coders will see certain time periods now when they're less good at coding and will see others later when they're better at coding.
    • If we do want to assign them to temporal blocks, do we want to assign them all to the same block at the same time, or to different blocks?
      • If to different blocks, how do we split them? (Bearing in mind that it sounds like we want a fair amount of overlap between coders so we can check their work against each other.)
  2. Do we indeed want to sort coders' queues by date? I think the same issues apply here as apply to assigning them to temporal blocks, but sorting will affect things within blocks rather than across blocks.
  3. What's our plan for handling dropouts? The assignment mechanism will guarantee even and random coverage by all coders that are specified at the time of assignment, but of course if they drop afterward then we'll have a bunch of articles with non-random partial coverage, and then future assignments of course won't include the dropouts. This probably affects how many articles we want to assign at once: assigning a lot would be nice since it guarantees more even coverage across the data, but on the other hand that makes worse the problem of within-assignment gaps due to dropouts.
  4. How careful do we need to be about the fact that we're planning to start with triple-coverage and later move to double- or single-coverage of articles? Do we need to plan this into a randomization strategy now?

I realize(/hope?) that the answer to all of these might be "it doesn't matter since we're going to review everything in pass 2 anyway", but I thought it was worth checking since we need to make a call on all of these points before I can start getting articles loaded for the team by Friday.

Of course please let me know if any of that isn't clear...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/41?email_source=notifications&email_token=ADBJJ5M6GSQCN2JCK5JPNJLRCNG4DA5CNFSM4KTFM7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELO7VJY#issuecomment-584972967, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5OYMZVRZTCHRG3W7OTRCNG4DANCNFSM4KTFM7HA.

davidskalinder commented 4 years ago

Oh, and incidentally, I have not considered here the ability to weight coders' assignment probabilities according to how well they've been coding. I'm leaning toward thinking that this is a bad idea anyway; but if we definitely want this then please let me know how we want to do it.

davidskalinder commented 4 years ago

Besides date we also have newspaper; there are 20 different publications with different numbers of articles. I can print out the table from the data file I have, I think. If so, I’ll send that around.

Do we need to consider these when assigning articles? It seems to me like any unwanted within-newspaper correlation would be taken care of by whatever coder-article assignment strategy we use?

davidskalinder commented 4 years ago

So, @olderwoman / @matthewsmorgan / @limchaeyoon, I just talked to @alexhanna about all this and, as usual when I talk to her, learned a great deal. She said that with her team of coders, they have basically abandoned almost any attempt to randomize article assignment and instead organize almost everything to strongly favor narrative continuity for the coders. They have a bunch of pretty distinct campus newspapers and they want two coders on each article, so they will just assign all the articles from a particular newspaper to two coders and then have the coders do them in strict chronological order.

This makes me think that maybe I'm being far too precious about worrying about any kind of randomization, and that the bias introduced by nonrandom coder assignment can be overcome in pass 2 (where it might, in theory, bias the workload of pass 2 coders but perhaps not the quality of the final data too badly). Technically speaking, this would be terrific, because it means we can just sort the article IDs in a spreadsheet and assign them to coders manually in about 15 minutes.

Alex did mention a few things about their project that makes the tradeoff between narrative continuity and intercoder bias especially strong: the newspapers have a focus that's sharply locally consistent, and the student journalists aren't the best at providing context. She picked an article at random whose lede simply said that busloads of protesters had arrived and secured a perimeter; so with articles like that, coders would have no clue what's going on unless they read what the paper had written in previous days. I don't know if we're going to have quite such strong context effects as that, but it is encouraging to hear that the Toronto team have felt like chronological ordering helps resolve this problem and haven't felt crippled by the loss of randomization.

So, if we followed the Toronto team's approach as closely as possible, I think we'd do something like:

I'm not sure though that there's enough within-newspaper continuity in our case for this approach to make much sense though?

Other designs I can think of that maximize chronological contiguity:

  1. Divide all our articles into fourish large groups by date (something like 1991-1999, 2000-2005, 2006-2010, and 2010-2015); then assign three coders to every article in the first group, three coders to the second, and so on. I think that's a little clunky since the article groups would be quite large (~1500 articles apiece), and there's no elegant way I can think of to handle issues like dropouts...
  2. Same as above but with much smaller date blocks: so, just split like one year into quarters, assign three coders to the first quarter, three to the second, and so on. Coders would experience a time jump of a year or so when they reach the end of one assignment block and move to the next.
  3. A rolling-overlap design where each coder gets a chronologically contiguous block that overlaps another's. We'd need to decide how big to make the blocks: big blocks could span the whole period more quickly, but too big and it would take a while before we get intercoder overlap.
  4. A version where assignment and overlap are randomized but within as tight a range as possible (165 articles, I think): coders could get gaps between each article, but the gaps wouldn't be too long (about 3.5 days on average, I think).

I think making a choice depends on several questions though that I don't know the answers to:

Sorry to bombard everybody with questions, but these design decisions are ones I don't think I can resolve myself, and until they are resolved I can't move forward with getting something implemented by Friday. I can work on the date-sorting functionality in the meantime, but otherwise I'll continue to stand by on this until we decide which way we want to go with this stuff...

olderwoman commented 4 years ago

I’m at home and wasn’t planning to come into the office again until Friday afternoon, but it sound like a video chat on this topic needs to happen. I added Morgan and Chaeyoon to the threat because I’m not sure how the mentions are working via github, you may get this twice. Last night I sent the spreadsheet that shows the newspaper/year breakout. It is also on gdelt in Black Protest Data/ original downloads / black newspapers/ blacknews_articles_protest.xlsx

There are a few newspapers with a LOT more articles than others, and some years with a lot more articles than othes.

From: davidskalinder notifications@github.com Sent: Wednesday, February 12, 2020 1:13 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Fully understand article assignment (#41)

So, @olderwomanhttps://github.com/olderwoman / @matthewsmorganhttps://github.com/matthewsmorgan / @limchaeyoonhttps://github.com/limchaeyoon, I just talked to @alexhannahttps://github.com/alexhanna about all this and, as usual when I talk to her, learned a great deal. She said that with her team of coders, they have basically abandoned almost any attempt to randomize article assignment and instead organize almost everything to strongly favor narrative continuity for the coders. They have a bunch of pretty distinct campus newspapers and they want two coders on each article, so they will just assign all the articles from a particular newspaper to two coders and then have the coders do them in strict chronological order.

This makes me think that maybe I'm being far too precious about worrying about any kind of randomization, and that the bias introduced by nonrandom coder assignment can be overcome in pass 2 (where it might, in theory, bias the workload of pass 2 coders but perhaps not the quality of the final data too badly). Technically speaking, this would be terrific, because it means we can just sort the article IDs in a spreadsheet and assign them to coders manually in about 15 minutes.

Alex did mention a few things about their project that makes the tradeoff between narrative continuity and intercoder bias especially strong: the newspapers have a focus that's sharply locally consistent, and the student journalists aren't the best at providing context. She picked an article at random whose lede simply said that busloads of protesters had arrived and secured a perimeter; so with articles like that, coders would have no clue what's going on unless they read what the paper had written in previous days. I don't know if we're going to have quite such strong context effects as that, but it is encouraging to hear that the Toronto team have felt like chronological ordering helps resolve this problem and haven't felt crippled by the loss of randomization.

So, if we followed the Toronto team's approach as closely as possible, I think we'd do something like:

I'm not sure though that there's enough within-newspaper continuity in our case for this approach to make much sense though?

Other designs I can think of that maximize chronological contiguity:

  1. Divide all our articles into fourish large groups by date (something like 1991-1999, 2000-2005, 2006-2010, and 2010-2015); then assign three coders to every article in the first group, three coders to the second, and so on. I think that's a little clunky since the article groups would be quite large (~1500 articles apiece), and there's no elegant way I can think of to handle issues like dropouts...
  2. Same as above but with much smaller date blocks: so, just split like one year into quarters, assign three coders to the first quarter, three to the second, and so on. Coders would experience a time jump of a year or so when they reach the end of one assignment block and move to the next.
  3. A rolling-overlap design where each coder gets a chronologically contiguous block that overlaps another's. We'd need to decide how big to make the blocks: big blocks could span the whole period more quickly, but too big and it would take a while before we get intercoder overlap.
  4. A version where assignment and overlap are randomized but within as tight a range as possible (165 articles, I think): coders could get gaps between each article, but the gaps wouldn't be too long (about 3.5 days on average, I think).

I think making a choice depends on several questions though that I don't know the answers to:

Sorry to bombard everybody with questions, but these design decisions are ones I don't think I can resolve myself, and until they are resolved I can't move forward with getting something implemented by Friday. I can work on the date-sorting functionality in the meantime, but otherwise I'll continue to stand by on this until we decide which way we want to go with this stuff...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/41?email_source=notifications&email_token=ADBJJ5PUUK5KHF5Q6WTAOBLRCRC5BA5CNFSM4KTFM7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELSASRQ#issuecomment-585369926, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5ORFA45TP554JFGSVDRCRC5BANCNFSM4KTFM7HA.

davidskalinder commented 4 years ago

See emails for meeting scheduling.

davidskalinder commented 4 years ago

Notes from meeting about this (copied from running meeting notes doc):

davidskalinder commented 4 years ago

Okay, I believe I have now assigned these in the production version as @matthewsmorgan's list specified.

Things look good to me, but @olderwoman and/or @matthewsmorgan, it might be a good idea to have a quick check of a few items in each assignment list (you should be able to do this pretty easily in the admin page of the production version) to make sure things are assigned as expected. (I'll email separately a link to the production version!)

davidskalinder commented 4 years ago

Closing this for now; feel free to reopen if we notice something wrong with the first assignment (issues with future assignments should probably get a new ticket).