colonelpanic8 / okcupyd

A Library that enables programmatic interaction with okcupid.com, using okcupid.com's private okcupid JSON API and html scraping when necessary.
MIT License
110 stars 18 forks source link

Profile.contacted, etc not working #76

Open kfred opened 8 years ago

kfred commented 8 years ago

I'm loving this library, despite most features not working. But the search alone, as limited as it is, is prettty awesome.

One thing I would like to point out is the broken profile.contacted feature. I'll admit that I have not looked much at all into your code base, but assume that feature was using HTML to retrieve that assignment (prior to JSON API, I assume). However, it appears JSON now returns a last_contacted value, a preferable substitute.

The reason why this feature is so important is that I can easily implement some auto-message logic from either the essay or question answers, and I don't want to re-message girls I've already messaged.

I really am lost as to where to start to help fix this feature. But I'll volunteer to work on it (with some direction first, if you wouldn't mind sharing).

Even just being able to access the JSON object returned from search would allow me to create a workaround, using the last_contacted value.

Thanks again and awesome work on this library.

colonelpanic8 commented 8 years ago

Yeah, as you've noticed, the UI rehaul the site has recently undergone has left okcupyd in a pretty broken state.

It seems like you are talking about populating a json value returned from search. This is fine, a good idea even, but you should realize that the canonical way that the value is retrieved from a profile is not from a value from search, but obtained through something returned from the users profile page. Its value is defined here https://github.com/IvanMalison/okcupyd/blob/345e9cc72bab800b201d10b447324ea193140cef/okcupyd/profile.py#L169

I suspect that we need to change the way we fetch that value from the html which should not be particularly hard.

Returningto the idea of optionally populating the value of search -- In the past, okcupyd has optionally populated certain values from search to avoid loading profile pages to get basic data, we could populate last_contacted in this way (the place to do it would be https://github.com/IvanMalison/okcupyd/blob/345e9cc72bab800b201d10b447324ea193140cef/okcupyd/json_search.py#L133, although right now no such values are populated there).

It would probably be as simple as something like

yield Profile(self._session, profile_info["username"],
                              contacted=bool(profile_info["last_contacted"]))
colonelpanic8 commented 8 years ago

Even just being able to access the JSON object returned from search would allow me to create a workaround, using the last_contacted value.

Yeah if you wanted to add a feature that added the search metadata to the profile, I wouldn't necessarily object to that. That would be as simple as

yield Profile(self._session, profile_info["username"], search_data=profile_info)
kfred commented 8 years ago

Cool. Thanks for the background. I will do some more research into the areas you have highlighted in my nightly downtime. And will focus on updating the profile.contacted code. I might stick with the HTML variant for now, but am interested in learning how to achieve it both ways.

I definitely agree on using HTML to fetch much of this data, particularly essay data, which hopefully should be straightforward to update selector references as they change. However, some of these popover HTML data points, from my past experience using Web scraping mechanisms on OKC, change more rapidly and sometimes in tricky ways.

That's why I'm fond of getting as much as possible from the JSON search results.

Once I get this profile.contacted boolean sorted out, I might also give a go at a separate feature that returns the last contacted date--which I believe has merit in addition to the boolean. I can, at least, think of ways it can further assist in downstream auto-messaging logic.

Also, full disclosure--I'm a github n00b. I am a developer-hobbyist and generally work lone wolf. So I am still learning the get/pull/etc lingo, concepts and git-style collaborative workflow.

colonelpanic8 commented 8 years ago

Cool. Thanks for the background. I will do some more research into the areas you have highlighted in my nightly downtime. And will focus on updating the profile.contacted code. I might stick with the HTML variant for now, but am interested in learning how to achieve it both ways.

Yeah this is the better way to do it since it works if for instance you obtain the profile by looking through an incoming message...

I definitely agree on using HTML to fetch much of this data, particularly essay data, which hopefully should be straightforward to update selector references as they change. However, some of these popover HTML data points, from my past experience using Web scraping mechanisms on OKC, change more rapidly and sometimes in tricky ways.

In my experience they change about bi annually, so we shoudl be able to get a decent bit of milage out of updating the selectors that we're using.

That's why I'm fond of getting as much as possible from the JSON search results.

This is also subject to change though, and again, only works for search results.

Once I get this profile.contacted boolean sorted out, I might also give a go at a separate feature that returns the last contacted date--which I believe has merit in addition to the boolean. I can, at least, think of ways it can further assist in downstream auto-messaging logic.

Also, full disclosure--I'm a github n00b. I am a developer-hobbyist and generally work lone wolf. So I am still learning the get/pull/etc lingo, concepts and git-style collaborative workflow.

hah no worries. don't pull, fetch and merge, this seems like a decent explanation of why http://longair.net/blog/2009/04/16/git-fetch-and-merge/

kfred commented 8 years ago

I just submitted a pull request that fixes the following profile features:

profile.contacted, profile.age, profile.liked, profile.match_percentage, profile.enemy_percentage

I encountered an issue with profile.location. not sure what's going on with the page layout, but the selector is not being recognized, despite its similarity to profile.age (which i was able to fix).

Edit: I looked further and am able to capture the text from the location span. The issue appears to be related to other okcupyd dependencies of profile.location

kfred commented 8 years ago

The new way OKC shows last contacted date on the profile will require some normalization:

if message occurred less than 7 days ago, only day of week is shown if last message occurred in current year, then year is omitted

kfred commented 8 years ago

I also fixed profile.id and took a stab at fixing essays.py, to no avail.

Something odd is happening with some of the text elements on OKC not being returned through text_contents().

Edit: Actually, I believe it may be partially related to other okcupyd dependencies causing conflicts. (See above re: profile.location).

colonelpanic8 commented 8 years ago

thanks! I'll take a look soon.

kfred commented 8 years ago

Cool. So, I did fix profile.id in my fork, but it's in a commit with my (failed) attempt at fixing essays. My fork is synced with my local, but I have not yet submitted a pull request for it.

kfred commented 8 years ago

Some of OKC's UI changes seem to have been done in a deliberate attempt to make scraping more difficult (as you'll notice when looking at some of the updates, particularly essays).

colonelpanic8 commented 8 years ago

@kfred hah thats interesting. I think that may be a direct response to okcupyd. They definitely know about it and have blocked my ip several times. I haven't looked at what they have done, but im sure there is a way around it.

kfred commented 8 years ago

Haha, they've got you on their watchlist! Oh man, I hope none of OKC's lackeys show up at my door with baseball bats!

kfred commented 8 years ago

My latest commit (in my forked repo), contains a profile.plocation function, which demonstrates how the profile.location fix could be applied (sans current profile.location dependency conflicts)