Closed davidskalinder closed 4 years ago
The more I look at this, the more I think I should leave the current setup as it is. There's a lot of code in MAI to handle this (a whole library, in fact! Not that that's all that unusual in general, but it is for MAI), and it looks like it should be working the same as it ever did.
So I think a better plan will be to review the code to make sure I understand what it's doing and then to properly enable sorting queues by date: first by fixing #39, then by sticking an ORDER BY line into whichever query is actually inserting the queue items...
All right, I think I understand how everything works now. Here's the backbone of the code, with comments:
bins = assign_lib.createBins(user_ids, group_size)
This creates all possible combinations ("bins") of size group_size
chosen from the group of selected user_ids
. The order of the bins is randomized.
num_sample = assign_lib.generateSampleNumberForBins(num,
len(user_ids), group_size)
This finds the smallest number of articles necessary to evenly cover all the bins, given how many articles are assigned to each coder. For example, 12 coders with triple coverage means 220 bins; to cover 220 bins evenly requires at least 55 articles per coder. So if the admin user requests between 0 and 54 articles per user, num_sample will be 0 and nothing will be assigned; if the admin requests 55-109, num_sample will be 220 and 220 will be assigned; if admin requests 110-164, it'll be 440; and so on.
if db_name:
articles = assign_lib.generateSample(num_sample, db_name,
pass_number = 'ec')
This is the one we'll probably always use (by specifying which article database we want). This randomly samples num_sample
articles from all the articles that aren't in anybody's queue yet and returns those article IDs in a random order.
else:
articles = assign_lib.generateSample(num, None, 'ec', pub)
I doubt we'll use this one, but this gets articles when the publication name is specified. NB that this list gets sorted by (MAI) ID for some reason.
assign_lib.assignmentToBin(articles, bins, pass_number = 'ec')
This assigns the first article sampled above, assigns it to everybody in the first bin, then moves on to the second article and the second bin and does the same, then to the third, and so on. The articles will come out even because generateSampleNumberForBins
always returns a number that evenly covers all possible bins.
So, the big issue here tech-wise is that none of the existing code can handle limiting the pool of articles to choose from: it will always sample evenly from all unassigned articles. So if we want coders to get temporally clustered queues in order to gain expertise, we need to do something new. That leaves us with several implementation options:
Option 1i is not achievable this week, so that leaves options 1ii and 2. I think the time needed for each of these is roughly the same, so I'm inclined to do 1ii since some of that will be useful going forward, whereas 2 is a strictly temporary fix. @olderwoman, do you have a strong opinion? Note that I think the whole team needs to consider some larger study-design implications beyond these strictly technical points, so I'm going to make another comment for those and mention everybody.
I agree this is a tough one and it is hard to figure out all the pluses and minuses of the different options.
From: davidskalinder notifications@github.com Sent: Tuesday, February 11, 2020 6:50 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Grok article assignment (#41)
So, the big issue here tech-wise is that none of the existing code can handle limiting the pool of articles to choose from: it will always sample evenly from all unassigned articles. So if we want coders to get temporally clustered queues in order to gain expertise, we need to do something new. That leaves us with several implementation options:
Option 1i is not achievable this week, so that leaves options 1ii and 2. I think the time needed for each of these is roughly the same, so I'm inclined to do 1ii since some of that will be useful going forward, whereas 2 is a strictly temporary fix. @olderwomanhttps://github.com/olderwoman, do you have a strong opinion? Note that I think the whole team needs to consider some larger study-design implications beyond these strictly technical points, so I'm going to make another comment for those and mention everybody.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/41?email_source=notifications&email_token=ADBJJ5OGGQ64ECF5V5DFU23RCNBS7A5CNFSM4KTFM7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELO2QTI#issuecomment-584951885, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5MWN2FRJ2RNRBZFTMTRCNBS7ANCNFSM4KTFM7HA.
Okay, so @olderwoman, @matthewsmorgan, and @limchaeyoon, I think I'm to the point where I need a team consensus on this one. I've outlined above several tech options for article assignment, but to implement any of them requires decisions that I think are study-design decisions rather than just tech decisions, so I wanted to check with y'all. Here's what I think we need to decide:
I realize(/hope?) that the answer to all of these might be "it doesn't matter since we're going to review everything in pass 2 anyway", but I thought it was worth checking since we need to make a call on all of these points before I can start getting articles loaded for the team by Friday.
Of course please let me know if any of that isn't clear...
These are hard design issues. Coders WILL GRADUATE and QUIT. We WILL hire new coders. So as I think about it, perfect balancing is impossible in the longer run. Randomly assigning articles to coders eliminate the correlation between coder and event era. Assigning by date builds in a correlation between coder and era. There is just no way around that. We can have one or the other but not both.
The advantage to random is that we are not dependent on one coder’s definition of an event, we are more likely to find “all” events. The disadvantage of random is that it is easier to code events in sequence so you know what was going on.
Besides date we also have newspaper; there are 20 different publications with different numbers of articles. I can print out the table from the data file I have, I think. If so, I’ll send that around.
From: davidskalinder notifications@github.com Sent: Tuesday, February 11, 2020 7:35 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Grok article assignment (#41)
Okay, so @olderwomanhttps://github.com/olderwoman, @matthewsmorganhttps://github.com/matthewsmorgan, and @limchaeyoonhttps://github.com/limchaeyoon, I think I'm to the point where I need a team consensus on this one. I've outlined above several tech options for article assignment, but to implement any of them requires decisions that I think are study-design decisions rather than just tech decisions, so I wanted to check with y'all. Here's what I think we need to decide:
I realize(/hope?) that the answer to all of these might be "it doesn't matter since we're going to review everything in pass 2 anyway", but I thought it was worth checking since we need to make a call on all of these points before I can start getting articles loaded for the team by Friday.
Of course please let me know if any of that isn't clear...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/41?email_source=notifications&email_token=ADBJJ5M6GSQCN2JCK5JPNJLRCNG4DA5CNFSM4KTFM7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELO7VJY#issuecomment-584972967, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5OYMZVRZTCHRG3W7OTRCNG4DANCNFSM4KTFM7HA.
Oh, and incidentally, I have not considered here the ability to weight coders' assignment probabilities according to how well they've been coding. I'm leaning toward thinking that this is a bad idea anyway; but if we definitely want this then please let me know how we want to do it.
Besides date we also have newspaper; there are 20 different publications with different numbers of articles. I can print out the table from the data file I have, I think. If so, I’ll send that around.
Do we need to consider these when assigning articles? It seems to me like any unwanted within-newspaper correlation would be taken care of by whatever coder-article assignment strategy we use?
So, @olderwoman / @matthewsmorgan / @limchaeyoon, I just talked to @alexhanna about all this and, as usual when I talk to her, learned a great deal. She said that with her team of coders, they have basically abandoned almost any attempt to randomize article assignment and instead organize almost everything to strongly favor narrative continuity for the coders. They have a bunch of pretty distinct campus newspapers and they want two coders on each article, so they will just assign all the articles from a particular newspaper to two coders and then have the coders do them in strict chronological order.
This makes me think that maybe I'm being far too precious about worrying about any kind of randomization, and that the bias introduced by nonrandom coder assignment can be overcome in pass 2 (where it might, in theory, bias the workload of pass 2 coders but perhaps not the quality of the final data too badly). Technically speaking, this would be terrific, because it means we can just sort the article IDs in a spreadsheet and assign them to coders manually in about 15 minutes.
Alex did mention a few things about their project that makes the tradeoff between narrative continuity and intercoder bias especially strong: the newspapers have a focus that's sharply locally consistent, and the student journalists aren't the best at providing context. She picked an article at random whose lede simply said that busloads of protesters had arrived and secured a perimeter; so with articles like that, coders would have no clue what's going on unless they read what the paper had written in previous days. I don't know if we're going to have quite such strong context effects as that, but it is encouraging to hear that the Toronto team have felt like chronological ordering helps resolve this problem and haven't felt crippled by the loss of randomization.
So, if we followed the Toronto team's approach as closely as possible, I think we'd do something like:
I'm not sure though that there's enough within-newspaper continuity in our case for this approach to make much sense though?
Other designs I can think of that maximize chronological contiguity:
I think making a choice depends on several questions though that I don't know the answers to:
Sorry to bombard everybody with questions, but these design decisions are ones I don't think I can resolve myself, and until they are resolved I can't move forward with getting something implemented by Friday. I can work on the date-sorting functionality in the meantime, but otherwise I'll continue to stand by on this until we decide which way we want to go with this stuff...
I’m at home and wasn’t planning to come into the office again until Friday afternoon, but it sound like a video chat on this topic needs to happen. I added Morgan and Chaeyoon to the threat because I’m not sure how the mentions are working via github, you may get this twice. Last night I sent the spreadsheet that shows the newspaper/year breakout. It is also on gdelt in Black Protest Data/ original downloads / black newspapers/ blacknews_articles_protest.xlsx
There are a few newspapers with a LOT more articles than others, and some years with a lot more articles than othes.
From: davidskalinder notifications@github.com Sent: Wednesday, February 12, 2020 1:13 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Fully understand article assignment (#41)
So, @olderwomanhttps://github.com/olderwoman / @matthewsmorganhttps://github.com/matthewsmorgan / @limchaeyoonhttps://github.com/limchaeyoon, I just talked to @alexhannahttps://github.com/alexhanna about all this and, as usual when I talk to her, learned a great deal. She said that with her team of coders, they have basically abandoned almost any attempt to randomize article assignment and instead organize almost everything to strongly favor narrative continuity for the coders. They have a bunch of pretty distinct campus newspapers and they want two coders on each article, so they will just assign all the articles from a particular newspaper to two coders and then have the coders do them in strict chronological order.
This makes me think that maybe I'm being far too precious about worrying about any kind of randomization, and that the bias introduced by nonrandom coder assignment can be overcome in pass 2 (where it might, in theory, bias the workload of pass 2 coders but perhaps not the quality of the final data too badly). Technically speaking, this would be terrific, because it means we can just sort the article IDs in a spreadsheet and assign them to coders manually in about 15 minutes.
Alex did mention a few things about their project that makes the tradeoff between narrative continuity and intercoder bias especially strong: the newspapers have a focus that's sharply locally consistent, and the student journalists aren't the best at providing context. She picked an article at random whose lede simply said that busloads of protesters had arrived and secured a perimeter; so with articles like that, coders would have no clue what's going on unless they read what the paper had written in previous days. I don't know if we're going to have quite such strong context effects as that, but it is encouraging to hear that the Toronto team have felt like chronological ordering helps resolve this problem and haven't felt crippled by the loss of randomization.
So, if we followed the Toronto team's approach as closely as possible, I think we'd do something like:
I'm not sure though that there's enough within-newspaper continuity in our case for this approach to make much sense though?
Other designs I can think of that maximize chronological contiguity:
I think making a choice depends on several questions though that I don't know the answers to:
Sorry to bombard everybody with questions, but these design decisions are ones I don't think I can resolve myself, and until they are resolved I can't move forward with getting something implemented by Friday. I can work on the date-sorting functionality in the meantime, but otherwise I'll continue to stand by on this until we decide which way we want to go with this stuff...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/41?email_source=notifications&email_token=ADBJJ5PUUK5KHF5Q6WTAOBLRCRC5BA5CNFSM4KTFM7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELSASRQ#issuecomment-585369926, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5ORFA45TP554JFGSVDRCRC5BANCNFSM4KTFM7HA.
See emails for meeting scheduling.
Notes from meeting about this (copied from running meeting notes doc):
Okay, I believe I have now assigned these in the production version as @matthewsmorgan's list specified.
Things look good to me, but @olderwoman and/or @matthewsmorgan, it might be a good idea to have a quick check of a few items in each assignment list (you should be able to do this pretty easily in the admin page of the production version) to make sure things are assigned as expected. (I'll email separately a link to the production version!)
Closing this for now; feel free to reopen if we notice something wrong with the first assignment (issues with future assignments should probably get a new ticket).
Copying from emails:
@davidskalinder sez:
@olderwoman replies:
@davidskalinder replies: