Document this module and make it easier for others to re-run

ragesoss commented 3 years ago

This module hasn't been used to recreate the original analysis Kevin did in several years. Try to make it work, document any problems found, and add documentation and/or fixes to make it practical to do similar analyses on a regular basis.

tab1tha commented 3 years ago

I'm on it. How do I submit the documentation though, in what format will you prefer? and please can you link me to Kevin's original analysis? I have not come across it yet.

ragesoss commented 3 years ago

I believe this is Kevin's analysis based on this module: https://meta.wikimedia.org/wiki/Research:Wiki_Ed_student_editor_contributions_to_sciences_on_Wikipedia

ragesoss commented 3 years ago

As for the format... I guess the best option would probably be to add a new markdown file with details on how to use it, along with inline comments for anything within the code that you think should be clarified.

tab1tha commented 3 years ago

okay. Thank you

tab1tha commented 3 years ago

[Help needed] The main issue I have been having for days now is that the mwdumps --wiki=enwiki --verbose /home/tab1tha/Documents takes hours but does not run to completion. I have had to keep interrupting it using Ctrl+C and rerunning it so that a few more files are downloaded. Do I need to use all the files? What arguments can I use to select only the relevant files.

I am planning to replicate Kevin's research using topic contribution data for the year 2020.

This is the level at which the command is at now: https://pastebin.com/SFVSXKwp

ragesoss commented 3 years ago

Thanks for the update! Hmm... I suspect that Kevin may have done this from Toolforge, and even if he didn't, that's probably the best way around the problem you're facing. https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_Introduction

I suggest going through the process to get Toolforge access and try to do it from there, since using a server within the same clould environment should make the dump downloads much faster and more reliable.

tab1tha commented 3 years ago

Thank you. I'll go through the guide, set it up and give it another try.

tab1tha commented 3 years ago

I requested for toolforge access and it says here (https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart) that I have to wait a week for it to be granted.

Are there any other related tasks that I can be working on until then?

ragesoss commented 3 years ago

Hopefully it will be less than a week, but here's another related analysis module you could look at: https://github.com/WikiEducationFoundation/academic_classification

Similarly to this one, it's from the work Kevin was doing several years ago and we'd love to be able to easily re-run similar analyses on more recent data, so documenting where the bottlenecks and problems are will be helpful.

tab1tha commented 3 years ago

Okay. I'm checking it out now

tab1tha commented 3 years ago

I have received toolforge access and it's taking me surprisingly long to understand how to use it. I apologize for my speed so far, I am doing my best to make a substantial contribution before the 30th.

ragesoss commented 3 years ago

Thanks @tab1tha! Sorry I couldn't provide a more clear-cut way to dive in.

tab1tha commented 3 years ago

I have a few questions, Do I need to create a toolforge tool?. Where do I run commands like this mwdumps --wiki=enwiki --verbose /home/tab1tha/Documents, is it on the toolforge shell ? I can't find the toolforge shell. I have however succeeded to access the dumps from PAWS using the command ls /public/dumps/public

ragesoss commented 3 years ago

yes, creating a tool might be the best way to go.

The 'toolforge' shell probably just means the terminal once you've logged on to toolforge via SSH. If you can get to the PAWS dumps, I think that means you're in the toolforge shell already.

tab1tha commented 3 years ago

Ohh. This is helpful. Thank you

tab1tha commented 3 years ago

[Update: help needed] This commit shows the work I have done so far. I am at the point where I cannot write the output files to a folder which I created. This might be because of some restrictions on the toolforge platform. The error log is pasted here .

I receive Error [13] which says that I do not have file permissions but when I check, it shows that I do have all the file permissions for that folder. Trying to use sudo with the command fails too with this error .

ragesoss commented 3 years ago

@tab1tha it looks like out=/demo_results specifies an absolute path, rather than a path relative to your home directory or your tool's directory. Maybe that's why you're getting a permissions error?

tab1tha commented 3 years ago

using the relative path /home/tambetabitha/demo_results yields this instead https://pastebin.com/DhMSriMh It says now that enwiki-20201001-stub-meta-history9.xml.gz is not a directory

ragesoss commented 3 years ago

That seems like progress, perhaps. I don't know why it would be trying to treat that gzip file as a directory, though.

tab1tha commented 3 years ago

I have been trying to figure that out too. I'm looking at the code now.

tab1tha commented 3 years ago

I think it fails because the regex in topics.cmdline._get_files_to_work_on specifies that the filename must end with .xml. However, mine is still gzipped and ends in .gz.

`def _get_files_to_work_on(input_dir):

raw_files = [join(input_dir, f) for f in listdir(input_dir) if isfile(join(input_dir, f))] dump_files = [f for f in raw_files if re.match('.*stub-meta-history(\d+).xml', f)] return dump_files`

It is therefore necessary to unzip the file before passing it as a command line argument. Alternatively, we could adjust the regex code in the topics.cmdline module such that receives both zipped and unzipped files and in the case where the file in zipped, it unzips it using gzip.

tab1tha commented 3 years ago

In the meantime, considering that the commit of Demonstration.md is part of Pull request 5, I have changed the pull request name to a more appropriate one.

Also, am I on track with respect to the content and format of Demonstration.md so far? Is there something else that you expected or would want me to add?

tab1tha commented 3 years ago

To enable handling of .gz files, I have considered adding a try-except clause to the topics.cmdline._get_files_to_work_on function as such: def _get_files_to_work_on(input_dir): raw_files = [join(input_dir, f) for f in listdir(input_dir) if isfile(join(input_dir, f))] try: dump_files = [f for f in raw_files if re.match('.stub-meta-history(\d+).xml', f)] except expression as identifier: dump_files = [gzip.open(f) for f in raw_files if re.match('.stub-meta-history(\d+).xml.*', f)]
return dump_files

Is this okay? What would you prefer?

ragesoss commented 3 years ago

Anything that works is fine with me! I don't have much of a sense for what is the most Pythonic way to do things, so use your best judgment.

tab1tha commented 3 years ago

Okay. I'm on it !

WikiEducationFoundation / TopicContribs

Document this module and make it easier for others to re-run #3