Okcubot 2.0 - Githubissues

onbjerg commented 8 years ago

Rationale

The current bot is broken and does not get all of the information we want. Some features are missing, and some of them are crucial:

Profile text is not scraped (#5)
The bot does not automatically answer questions
Duplicates are saved in the dataset, meaning post-processing is needed (#4)
Multiple scrapers can not run at once without further post-processing (#3)

The codebase is also old. It uses an outdated version of Scrapy (#2) and it is not written optimally.

A way to fix all of this would be to create 2 scripts: a worker bot that scrapes OKCupid and a collector that aggregates all of the collected data and stores it in a file. The collector would also ensure no duplicates are added to the dataset, as well as a version key, in case the profile is updated.

Python has poor to no streaming support, while Node.js is built from the ground up on streaming philosophy. Node.js is trivial to read if one knows Javascript.

Additionally, using streams will make the modules less coupled. The bots do not care about the format of the dataset, they do not care about the means of transportation. The format is the only thing the collector cares about, and the transport protocol is up for the user to decide.

The installation process would also be painless, as one would only need to type npm install -g okcubot, whereafter the command okcubot is available globally.

Sidenote on updates The scraped data is sent to the collector along with a hash of the data. If the hash differs from the stored row (on duplicates), it will add it to the dataset and increment the version of that particular row.

For example: The user Sally has an ID of 1. It is scraped once on Wednesday, and again the following month. The collector sees that the hash differs, and the row is added (even though it is a duplicate) and increment the metadata column (meta-column?) for versions by 1. The dataset now looks like this(*):

ID        Name          Orientation              Version
1         Sally         Straight                 1
2         Bob           Straight                 1
1         Sally         Asexual                  2

Checklist

Rewrite

[ ] Use streams and Node.js
[ ] Scraper
[ ] Collector
[ ] Deduplication module
[ ] Formats (.csv, .tsv, .json)
New features
[ ] Scrape profile text (#5)
[ ] Scrape profile pictures (#6)
[ ] The bot answers questions in order to see a users response
[ ] Sync between scrapers - multiple scrapers will be painless to set up (allowing for faster scraping, #3)
Backwards-compatibility
[ ] Convert the old dataset to the new format
[ ] Scrape the missing information from the old dataset
Meta
[ ] Find a cooler name (maybe in the UNIX spirit?)
[ ] Determine what metadata columns should be present, e.g. version and scrape date (@Deleetdk)

onbjerg commented 8 years ago

Good idea/bad idea/needed? @Deleetdk

Deleetdk commented 8 years ago

For example: The user Sally has an ID of 1. It is scraped once on Wednesday, and again the following month. The collector sees that the hash differs, and the row is added (even though it is a duplicate) and increment the metadata column (meta-column?) for versions by 1. The dataset now looks like this(*):

A good idea is to add a time stamp for the last scrape for that user.

Find a cooler name (maybe in the UNIX spirit?)

Depends if you want to make it public. My worry is that 1) we could get into legal trouble (maybe), 2) if it becomes public, then OKCupid may add anti-crawler security preventing us from gathering data or making it tedious. If you don't want to make it public, then the name isn't so important.

What about handling of users that have deleted their profiles? What about users that reset their question/answers?

onbjerg commented 8 years ago

What about handling of users that have deleted their profiles? What about users that reset their question/answers?

We could periodically scrape the users we have in the dataset and check if they still exist.

onbjerg commented 8 years ago

By the way, we probably want this to be somewhat anonymous. We could use a hash of the username instead of the actual username.

Deleetdk commented 8 years ago

I don't see a particular reason for doing that. It makes it impossible to verify the data for 3rd parties and the information we have is public, so there is no privacy violation. We have merely aggregated it in a more useful format.

onbjerg commented 8 years ago

Right.

Deleetdk / OKCubot

Okcubot 2.0 #7

Rationale

Checklist

Rewrite

New features

Backwards-compatibility

Meta