WikiEducationFoundation / TopicContribs

Module for analyzing contributions to a topic on Wikipedia.
1 stars 2 forks source link

Troubleshooting #1

Open ragesoss opened 6 years ago

ragesoss commented 6 years ago

@kjschiroo I'm working on this for the fall_2016 and spring_2017 terms, but I don't know what to do on wmflabs in terms of the user_database instructions. I have a CSV file of student user names for a term, and I assume I can use that to make a user database on labs?

Any pointers for how to proceed would be much appreciated.

kjschiroo commented 6 years ago

Let me restate your problem to confirm I understand it. You are trying to create a user database, and aren't sure how to do that on wmflabs, correct? I think this page should offer some helpful advice. Let me know if that wasn't your problem or if you need further help.

ragesoss commented 6 years ago

@kjschiroo aha! that's helpful. I inferred that user_database was a database of users, rather than an arbitrary database own by my account.

ragesoss commented 6 years ago

Okay... so I've logged in to a tool account, connected to the enwiki.labsdb database, created a new user database, saved the sql file with that user database, and then did source page_project_map.sql from the mysql command line.

It appears to be running, but it's been about 20 minutes so far and it hasn't returned. Is that expected, and am I going in the right direction?

ragesoss commented 6 years ago

Okay, progress! Documenting this mainly for my own sake... Doing it from within mysql wasn't the right approach because it just did output to the terminal, but it did complete after a quite a while. Trying it again as mysql --defaults-file=$HOME/replica.my.cnf -h enwiki.labsdb < page_project_map.sql > page_project_map.csv although it'll be tab-separated rather than comma... but I can fix that easily afterwards.

ragesoss commented 6 years ago

I ran it yesterday, and fired up a bunch of threads and eventually said No more items to process for most of the threads, but it seemed to hang at that point after getting back down to one process that stopped using CPU, and even overnight it never exited or produced output. I killed the process and I'm trying it again.

ragesoss commented 6 years ago

@kjschiroo I've tried it twice now, and it hangs after all the Mapper threads finish with 'no more items to process'. It's been running overnight after reaching that state yesterday, but it still hasn't exited or produced any output. Any ideas for what's wrong?

kjschiroo commented 6 years ago

What are the details of the machine you are running it on? My first guess would be that it ran out of memory and then silently killed one of the processes (I know this is one of Linux's nasty habits). Then it just waits forever because there is still a job that it is waiting for.

We're currently using mwxml.map for processing the dumps, iirc it will use all available cpu cores by default, if there are a lot of cores, but not as much memory then there can be a problem. When I was running this I believe the machine had somewhere around 100 GB of RAM.

Also, my apologies for the late response. If it happens again just hit me with a @kjschiroo to grab my attention.

ragesoss commented 6 years ago

@kjschiroo cool. My machine has a meager 16GB of RAM (and 8 threads). Maybe I should do this on wmflabs instead of locally.

kjschiroo commented 6 years ago

Yeah, 16GB on 8 threads is going to have at least one of them die. wmflabs might be an option, otherwise you could spin up a beefy machine on AWS or Google Cloud Platform for a day for a reasonable price. Alternatively, take a look at pull request #2. It should let you set the number of threads being used. Set it down to like 2, keep an eye on your memory and let it run for longer. I haven't been able to test it yet though, since I don't have any of the files on hand.

It is honestly one of the things that I'm most upset about with Unix based systems that they think you can just kill a process and not make it die loud.

ragesoss commented 6 years ago

Sweet! Giving it a try with 2 threads.

ragesoss commented 6 years ago

@kjschiroo that worked! How do I interpret the results? Is this bytes added by everyone to the topics (general) and bytes added by the input cohort? So I'd just combine those into one dataset to graph students vs. all of wikipedia?

ragesoss commented 6 years ago

Without any further adjustments beyond running the script with the fall 2016 users, it looks like the numbers are a lot lower than the ~6% during peak period that you found at https://meta.wikimedia.org/wiki/Research:Wiki_Ed_student_editor_contributions_to_sciences_on_Wikipedia

portion fall 2016

It's more like 1.5% for the most active period.

I'll run it for the spring 2016 to make sure I'm getting similar results for that one.

kjschiroo commented 6 years ago

Those look like the figures that I would have expected for overall contribution rate. What topics did you narrow it down to?

ragesoss commented 6 years ago

@kjschiroo I used the same science_projects.csv from the sample inputs.

kjschiroo commented 6 years ago

Hmm... that's interesting. Wiki Ed hasn't reduced its focus that much on the sciences since the year of science has it? Although a 4 fold increase when you were really pushing towards that wouldn't be that weird. What is the plot of general contribution's to the area? Let me know what the 2016 results are, if they are consistent then I'd guess that it is real, if not we have some investigating to do.

kjschiroo commented 6 years ago

Wait, that should have still had year of science going on... that is weird.

kjschiroo commented 6 years ago

I remember there was a push towards labeling all of the articles with a project which is how they are identified. Did that happen for the fall?

ragesoss commented 6 years ago

No, it didn't happen for the fall. I wonder how many were labeled in that way for spring 2016. I'll ask the team.

kjschiroo commented 6 years ago

I remember Ian and Adam making a concerted effort to get them labeled, although I don't know what portion they were needing to label. I could see how that would bring down the figures significantly though since it would omit many new articles that would be really positive by this metric.

ragesoss commented 6 years ago

@kjschiroo I must have something wrong with the filtering by project, because I get something very similar for spring 2016.

spring_2016

ragesoss commented 6 years ago

@kjschiroo With some print debugging, I note that it's putting about a million pages (1024116) into the 'pages of interest' set. That would mean about 1 out of 5 articles are in one of these science WikiProjects... seems a bit high, but maybe that's right?

kjschiroo commented 6 years ago

What is the total bytes added summing to for spring 2016? One potential issue I'm seeing here is that it only takes a couple people getting aggressive with their project labels to really change things and these changes end up being applied backward since their is no timestamp associated with them. One is the distribution of articles by project? Are their a couple that decided to go on a labeling spree.

Also, could you attach the results file?

I'm curious now. This could be an interesting thing about Wikipedia. It could be that when we were analyzing the new articles immediately after they were written we ended up getting a biased result because of our labeling efforts. If most new content is being added to new articles and those new articles take a while to get labeled, then there could have been a bunch of work that was happening, but we couldn't count as being relevant to our goal because it hadn't gotten a label yet. However, after a few months go by those articles slowly get project labels applied to them, then they end up counting. That's just a theory.

Let me go take a look at my labs account. I might be able to find the old project-page list that I used.

ragesoss commented 6 years ago

@kjschiroo My results files...

fall and spring 2016 results.zip

kjschiroo commented 6 years ago

Something is going on here and I'm not sure what. I'm doing some basic sanity checks right now. Validating against your dashboard in Spring 2016 there were 3.73 million words added total. According to this data set in Spring 2016 in science alone 207,886,463 bytes were added iirc there are about 5 bytes per word so 41 million words.

You don't happen to have multiple copies of dump files sitting around do you?

I've got to run right now, but I'll upload the files I've been referencing from spring 2016 later.

kjschiroo commented 6 years ago

Here are my files. wikied.zip

I've also included the pages that were labeled as science at the time. It is around 670,000. So 1,000,000 is higher than I'd expect, but not totally unbelievable.

ragesoss commented 6 years ago

@kjschiroo I have all the gz files, including both stub-meta-history and stub-meta-current. Maybe that is a problem? Will try without the -current ones.

ragesoss commented 6 years ago

Extra dumps don't appear to be the problem. I got the same output when I tried after deleting the -current dumps.

I'm now trying to use a modified version of this program to get the overall portion of content contributed by students... which I think I can do just by handling the case of no page maps by setting pages to None, in which case it should process all mainspace pages... if I am understanding it correctly.

kjschiroo commented 6 years ago

I'd be concerned that there is a deeper issue going on here. The total counts should reflect what we observe on the dashboard.

kjschiroo commented 6 years ago

@ragesoss I'm looking into this and am having trouble with the mysql connection timing out. Would you be able to save me a bit of trouble and post your page_project_map.csv?

ragesoss commented 6 years ago

@kjschiroo I can get that to you on Monday. Don't have access to the file this weekend.

ragesoss commented 6 years ago

@kjschiroo I shared a dropbox folder that now has the file.