clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.75k stars 1.58k forks source link

Python 3 support #62

Open tom-de-smedt opened 10 years ago

tom-de-smedt commented 10 years ago

Pattern should start supporting Python 3. Looking at the amount of code, it is a non-trivial task and any help is much appreciated.

pemistahl commented 10 years ago

Hi Tom,

I'm a graduate in computational linguistics and would like to contribute to Pattern. Can you be more explicit about how Pattern should support Python 3? That is, do you want to maintain two different branches in parallel, one for Python 2 and one for 3? Or do you want to have a single code base that works both with 2 and 3? In the latter case, a library such as six would be useful.

Let me know what you think.

Cheers, Peter

tom-de-smedt commented 10 years ago

Hi Peter,

My goal would be to have a single code base that works with 2 and 3, but I have little experience with Python 3 so I don't know how feasible it is. In any case, the task is becoming more urgent so I will start looking into it more. I took a look at six which seems very useful. It's MIT-licensed so it could be included in Pattern.

Any help is appreciated! Let me know what you think.

Best, Tom

hayd commented 10 years ago

:+1: on a single codebase.

I think the first stage is to add travis for testing (I looks like you're missing a requirements.txt file, so I'm unsure what deps it's missing (?) ). Travis will really help with conversion (and ensuring it continues to work on multiple platforms).

Happy to help if you can pass a requirements.txt.

waylonflinn commented 10 years ago

Got through the first two steps outlined by @hayd in this fork (repo has a requirements.txt and .travis.yml).

Some of the tests need to be excluded.

from test.py

# pattern.db tests require a valid username and password for MySQL.
# pattern.web tests require a working internet connection 
# and API license keys (see pattern.web.api.py) for Google and Yahoo API's.

Travis is just running python -m unittest discover -s test right now.

waylonflinn commented 10 years ago

Ran futurize on the codebase. Here are some preliminary findings:

  1. Some of the bundled dependencies appear to have already been futurized (they contain from __future__ import and from future import statements). Now that we have pip and virtualenv does it make sense to unbundle these?
  2. Unicode is used extensively throughout the codebase. I used from __future__ import unicode_literals in several places (mostly for raw string literals), but this should probably be handled more carefully in the long term.
  3. Does it make sense to replace web.json.encoder with the standard library module? There was a section starting with the comment ## HACK: hand-optimized bytecode; turn globals into locals that I wasn't sure how to deal with and had to comment out.

I'm a bit new to python, so any feedback is appreciated. This is a beautiful library and I'd love to see it get the unicode love from python 3.

hayd commented 10 years ago

My 2cents:

Not sure what to do about API keys, was wondering what other modules e.g. pandas did for those parts... IIRC there may be keys you can use for testing of clipped results...

Perhaps it makes sense to create a PR for this and comment there, then you can comment on specific bits of code :) ... first pass tests then make pretty

tom-de-smedt commented 10 years ago

There's an "official" fork of Pattern with the specific aim of making it compatible with Python 3: https://github.com/pattern3

The wiki has some more information: https://github.com/pattern3/pattern/wiki

The compatibility update is supported by a grant from the Python Software Foundation. This money is to be divided among contributors. You can read the grant proposal here: http://www.clips.ua.ac.be/media/Pattern-3-grant-proposal.pdf

The fork is initiated by myself, Waylon Flinn and David Branner. Everyone (Peter & hayd?) is welcome to join as admin of the project. As admin, you'll be able to edit anything so feel free to take initiative! (we do encourage pull requests, so we can keep track of who did what)

hayd commented 10 years ago

Happy to help with this, however when I tried (and trying again just now) running the tests I get a load of exceptions (python 2.7). I suspect this is just initial set up on my machine...

What do I need installed / setup to run the test suite (locally)?

Assuming fresh python install (or env) the following is failing:

git clone ...
cd pattern
python setup.py install  # this *ought* to install dependencies, but I don't think it does
nosetests  # this should sniff out and run all the tests, and does.

See to the travis run in the above fork: https://travis-ci.org/pinleague/pattern/builds/32799385 (this is the kind of thing that's failing though that's a couple of months old).

tom-de-smedt commented 10 years ago

Hi Andy,

My knowledge of Travis is zero, but different people including yourself have suggested it as a first step so I will examine it more closely. Looking at the output of the link you provided, these look like typical Python 2 vs 3 errors, e.g., using print stuff instead of print(stuff) and except Exception, e instead of except Exception as e. These are easy to fix, I previously used regular expressions to update them in the source code, but not yet in the unit tests. I'll look at updating the unit tests and push it to pattern3.

Best, Tom

hayd commented 10 years ago

@tom-de-smedt Lots of stuff to migrated to python 3 but this can really only done with confidence once tests pass (and at the moment I can't get them passing either locally or on travis on python 2.7!!!).

At the moment they (the python 2.7 tests) fail with errors from the bottom of this page: https://travis-ci.org/pinleague/pattern/jobs/32799386. Any ideas why?

pemistahl commented 10 years ago

Hi Tom,

as I wrote at the beginning of this year, I'm still interested in contributing to pattern. However, I have not started yet because I didn't really know where to start. But now there exists a concrete plan and I would like to be part of it. I haven't written Python code for more than a year now but it should be easy for me to get into it again (I wrote a lot of Python code during my studies and I like the language very much). Last but not least, I have been out of the computational linguistics area since I started my current job a year ago, but it would be great to deal with that stuff again.

Some things are not yet clear to me:

  1. You wrote that the fork should be made compatible with Python 3.3 but 3.4 has been released already. Shouldn't it be compatible with 3.4 then?
  2. In the fork's wiki you wrote that the fork should be made compatible with both Python 2.7 and 3.3. But what's the point of creating a fork explicitly named pattern3 if it should still support Python 2.7? In my opinion, we can provide for a much cleaner and optimized code base if we completely drop 2.7 support. Then, the usage of libraries such as six would become obsolete. Of course, the downside of this approach is to maintain two separate code bases.

I cannot tell you yet which module I would prefer to work on. First, I need to take a look at the code again. I'm not sure though whether it's a good idea to have a lot of admins for the fork. Working with pull requests is much better anyway due to the reasons you mentioned.

hayd commented 10 years ago

This was partially my misunderstanding (!), just running nosetests ran the abstract test methods, which fail (at least that's part of it). cleaning these classes is probably a good thing to do anyways (they are in an "interesting" style... e.g. IMO the suite functions should go), I've cleaned up a little...

I had to capture a few actual test failures and some HTTP403Forbidden and HTTP404NotFounds. There's also a couple of proper errors (in python 2), which for now I'm skipping those tests, but they really need looking at, I've labelled them FIXME in my branch (should I PR to pattern3 or here once passing?)...

As I said above, it worth making necessary that these tests pass reliably in python 2 before even attempting to migrate to python 3 (otherwise it's shooting in the dark). That said, I think the issues I've found (and labelled FIXME) are minor (or at least I'm hopeful that's the case if someone can look at them who understands the codebase!).

See https://github.com/hayd/pattern/compare/c5d9c2358...ce1fe8103ccb (and on travis https://travis-ci.org/hayd/pattern/builds/39245044, unfortunately not quite passing python 2.6 and 2.7, I may have to skip/fix a couple more? Some tests seem flaky - especially those that compare e.g. to 0.771!).

Note1: This allows the test suite to be run by simply calling nosetests (or py.test).

Note2: I'm skipping the mysql tests atm, but that's no biggie to fix just an install in the yml (our objective is for no tests to be skipped on travis), the others are more important, but I'm afraid I need a patterns expert to look at the FIXMEs!!


Just to clarify the objectives here:

hayd commented 10 years ago

To answer @pemistahl I don't think going fully py3 (and dropping support for py27 is (Edit: NOT) a good option for a library... for the next decade!). I would like to see a shared code base and drop support for python <= 2.5 (nearly every library is dropping python 2.5 support).

I'd really like to see pattern3 (once ready) merge upstream into pattern.

pemistahl commented 10 years ago

@hayd OK, I get your point. I'm okay with that. It just reminds me again of how unhappy I am about the Python 3.* transition in general across the Python community.

Another question @tom-de-smedt : If working with pull requests is the preferred way for contribution, then why did you create the pattern3 fork? Anyone who wants to contribute would create their own fork anyway. Wouldn't it be sufficient to simply create a branch here in the main repo for this purpose?

hayd commented 10 years ago

I've submitted a couple of PRs to the pattern3 branch, I think it makes sense to fix that up then merge back here (it's going to be easier to keep track of things if they are in separate repos, separate issues/PRs etc). I would strongly recommend downing-tools for a short-while (here on clips/pattern) - hopefully for only for a few weeks, and concentrate on the pattern3 branch/repo.

I'm "somewhat hopeful" it's not a massive job (famous last words). Once the python3 imports are working it should be clearer where the hit list is going to be (I suspect the toughest are the str/bytes handling).

hayd commented 10 years ago

Just to update those following at home, last night I got python 3 running all tests without syntax or import errors (of course, half those tests are failing), python 2 is still passing all the tests (except those tests which failed before migration which are skipped).

https://github.com/pattern3/pattern/pull/6

(It did require ripping out the bundled (vendorized) packages and making them dependancies - I think this is a good idea anyway... so, more "home-testing" in python 2 may be a good idea before this update is merged back clip/pattern? esp. where there is poor coverage.)

This means there is a more obvious hitlist of things to do. For those who want to help I recommend (once this is merged), attempting to make all the tests pass on specific testing files you're interested in (e.g. for database):

$ nosetests test/test_db.py
$ nosetests test/test_db.py:TestClass
$ nosetests test/test_db.py:TestClass.test_method

$ nosetests test/test_db.py --pdb --pdb-fail  # drop in when there's a failure/exception

A more complete todo list issue: https://github.com/pattern3/pattern/issues/5

I haven't really thought about how six fits here, IMO if it makes fixing a test easier then use it ?

hnykda commented 9 years ago

Hello,

I'm looking forward to use Pattern with Python 3, because my work is written in it. I'm kind of confused with current state of Python 3 support. This package is not installable (at least, not through pip - I'm getting Python 2 errors) and and the pattern3 doesn't contain all the code base (at first sight).

By the way, Python 3 is getting more and more focus today and it's very good idea to follow this trend. You use a lot of packages, somehow embedded which is definitely not good idea for the future (e.g. BeautifulSoup_v3.2.1 is not supported for years).

hayd commented 9 years ago

@kotrfa pattern3/pattern isn't on pip yet (so not installable), the tests aren't passing for python 3 either so it's not ready for release yet - though quite a bit of work has been done. I think the plan is for this fork to become the pattern on pip (at least that's my understanding), and it'll support both python 2 and 3.

In pattern3/pattern I've ripped out a load of the vendorised deps (which is perhaps why it looks like the code base is so different), for example beautiful soup. The tests from clips/pattern are still all there and all pass (in python 2), so nothing was removed in this process (I claim).

If you'd like to help out, which would be fantastic, please clone pattern3/pattern and see if you can help with anything in the todo list (maybe pick a test file and get it passing in both python 2 and 3, perhaps the section you need in your work?). I have a few of the areas of the codebase passing already (in both python 2 and 3), IMO it's not a huge amount of work to go :) mostly fiddly unicode stuff, then we can get it out on pip...

hnykda commented 9 years ago

Hello,

yeah - I was speaking about installing this fork, not Pattern3, which is, as you said, not available on pip.

I don't really need any part of pattern currently - my work is almost done and I've found Pattern to late, unfortunately. Nevertheless, maybe I could replace some parts of my current code using Pattern and simplify it. In that case, I would definitely like to help. But it doesn't seem likely I'll do it in following weeks, since end of semester is coming.

You have done quite a lot of amazing job by the way, thank you!

hayd commented 9 years ago

FYI all, I did a little the last couple of days, now test_db and test_web are the only remaining py3 failing tests files (also test_examples, but that's IMO a special case). I don't think they should be too bad to fix... e.g. main things

Surprisingly these are py3 only failures (the py2 still passes)...

That said, there are some hacks - especially the unicode workflow - which could be cleaned up.

Edit: Too hasty in victory, I've nearly got vector working https://travis-ci.org/hayd/pattern/jobs/43751620

hnykda commented 9 years ago

Thanks for the information! It is really promising. :+1:

hayd commented 9 years ago

@tom-de-smedt actually the vector thing is a little weird, it looks like that vector tests fails about 50% of the time on python 3 although it passes all the time on python 2; from running the test 10 times on both. In a way it's good that I think we're into a place where expertise is needed! :) see https://github.com/pattern3/pattern/pull/17

Zearin commented 9 years ago

+1 for Python 3 support.

I realize the need to support a mature, powerful, and loyal community of legacy Python users, but Python 3 is only going to get more relevant with time, not less.

More importantly, Python 3 is just better. Its standard library organization is much cleaner, its syntax is more readable, and in many common cases it performs significantly better than Python 2 (speed and/or memory footprint).

That said, it’s often tricker to port to Python 3 than it “feels” like it should be. For a while, six has helped make this a little easier, but it only went so far.

To make the transition as painless as possible, I strongly recommend the Python-Future package. It is way more powerful than six; it has tools focused on automating as much of the transition as possible; and it has truly excellent documentation.

I believe it was mentioned earlier in this thread, but I just wanted to reiterate its awesomeness for anyone that might have missed it. Seriously—just browsing its documentation can evoke the inspiration to transition to a 2-3 compatible codebase.


I haven’t used Pattern yet, but it also has excellent documentation (great job!). Unfortunately, my current research is in Python 3. That’s how I found my way to this page. I hope Pattern gets to Python 3 soon!

Keep up the excellent work, and May The Source™ Be With You!

hayd commented 9 years ago

@Zearin I used future to do the majority of the heavy lifting in the python 3 port, see the pattern3 repo. Please do try it out.

MarcosGinel commented 9 years ago

How could you define the "state" of the project for porting Pattern into Python 3?

I used two years ago for Python 2.7 and it was awesome, now I'm going to work with Python 3 and I would love to use it (Pattern) again!

Thanks!

legel commented 8 years ago

Greetings, we came across this from here, and I just noticed that while a lot of the build looks stable, support for Python 3.3 seems not to be working? At least that is how I would interpret the Travis CI page. Thanks.

legel commented 8 years ago

I just quickly tested it on Python 3.4, by creating a conda virtual environment with python 3.4 (using conda create -n python3 python=3.4 anaconda) and running the following:

    git clone https://github.com/pattern3/pattern.git
    cd pattern
    python setup.py install

However, unfortunately, upon testing, text parsing functions at least for the web module do not seem to work... In the test folder I ran python test_web.py which is what we are using, and the following is a sample of what I got back...

======================================================================
FAIL: test_plaintext (__main__.TestPlaintext)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_web.py", line 455, in test_plaintext
    u"<a href=\"http://www.domain.com\">link</a>\n\n* item1 xxx\n* item2")
AssertionError: 'tags amp; things\n\ntitle1\n\ntitle2\n\nparagr[93 chars]tem2' != 'tags & things\n\ntitle1\n\ntitle2\n\nparagraph[76 chars]tem2'
- tags amp; things

======================================================================
FAIL: test_encode_utf8 (__main__.TestUnicode)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_web.py", line 53, in test_encode_utf8
    self.assertTrue(isinstance(web.encode_utf8(s), str))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 91 tests in 0.500s

FAILED (failures=4, errors=40, skipped=1)                                       357d  ⍉

(python3)
davidhorat commented 8 years ago

I used pattern with Python 2 before and I loved it, but now I switched to Python 3. What is the status of porting Pattern to Python 3?

james-see commented 8 years ago

It is astonishing to me that someone hasn't completed a full update to get Python 3 version of pattern working. I guess I will fork pattern3 and try to finish it myself.

james-see commented 8 years ago

never mind. too many recursion errors, encoding errors, etc. someone who knows the actual codebase should really update it.

hayd commented 8 years ago

@jamesacampbell it's not so far off https://github.com/pattern3/pattern/issues/5

james-see commented 8 years ago

@hayd yeah i was getting bogged down in all of the failures in test_web and test_db but did notice that the others were passing. commenting in that thread now. thanks

fabriciorsf commented 8 years ago

Tips to write Python 2-3 compatible code: http://python-future.org/compatible_idioms.html https://docs.python.org/3/howto/pyporting.html

This library is very good, and it can't stop in time...

tom-de-smedt commented 7 years ago

Support for Python 3 is a long and ongoing discussion.

@hayd has done a lot of work on this -- see also https://github.com/pattern3/pattern

@hayd and I agreed that we should merge his work back into the main branch and take it step by step in a single branch, but I never get round to it due to time constraints. It's frustrating. Students at the university are now taught Python 3 and we can't offer Pattern to tell them about Natural Language Processing.

I will give it another go by submitting the task to Google Summer of Code.

But really Pattern needs more people that manage pull requests, that have push and admin rights, and that can take it into their own hands. Being the sole admin has worked well in the past to keep the source code clean and the focus tight, but we need to rethink this strategy.

Contact me at tom@organisms.be if you feel like stepping up.

pradyunsg commented 7 years ago

@tom-de-smedt I would be interested in taking this on as a part of GSoC.

proycon commented 7 years ago

Possibly relevant in this discussion: http://www.python3statement.org/ , major scientific Python projects are phasing out 2.7 support. You might even want to consider dropping 2.7 entirely and switch to 3 at a next major version release, if that makes the transition easier.

BolajiOlajide commented 7 years ago

Hello everyone, i've gone through the issues and it's very interesting. So many questions asked about pattern's compatibility with Python3. Currently working on a python3 project and i need pattern but i can't install due to compatibility issues.

@hayd i went through your pattern3 repo and noticed the Travis build was failing do you need help with that.

james-see commented 7 years ago

@andela-bolajide can't speak for @hayd but please if you have time fork it and get the travis builds passing and then do a merge request or I would use your fork as is in meantime if you get it fully working in Python 3. I don't have time myself to do it.

BolajiOlajide commented 7 years ago

Okay @jamesacampbell, i'll get to work tomorrow and i'll be in touch. Cheers man

achillesliu commented 7 years ago

Hi guys, I found that textblob is using pattern library and they provide python 3 support. So if anyone is in a hurry, just to there and check the docs.

afsun commented 7 years ago

Hi , I using python 3, I cant install pattern with anaconda. I try 'pip install pattern',but it doesn't work . this is the result: untitled

BolajiOlajide commented 7 years ago

Pattern isn't compatible with Python3 yet. @afsun

afsun commented 7 years ago

my friend installed it for python3. but she cant remember how did she do that ! I edited the file 'setup.py' and inserted parentheses for 'print'. but I dont know how to install pattern yet.where should I copy this file?

pradyunsg commented 7 years ago

@tom-de-smedt I would be interested in taking this on as a part of GSoC.

I did not take this project.

markus-beuckelmann commented 7 years ago

I will work on this issue as part of this year's GSoC, so there will definitely be some substantial progress over the summer. We will probably track most of the development in the preliminary pattern3 repository for now, since parts of the code are already ported. We'll see what the status quo is – what works and doesn't work – over the next weeks.

So if you want to be part of all that (which would be great!), bring in your ideas or thoughts on the process and make sure to follow the above mentioned repository.

james-see commented 7 years ago

@markus-beuckelmann very excited about this, thanks

tom-de-smedt commented 7 years ago

Update: As part of Google Summer of Code 2017, Markus Beuckelmann (@markus-beuckelmann) will be working on the future of Pattern (porting it to Python 3 is first on our list). Markus is admin of the repo now and can handle pull requests and invite collaborators. Be sure to reach out to him and include him in discussions about the port. About the pattern3 fork: a lot of work was done here by Andy Hayden (@hayd). Andy & I agreed that a fork, which was my idea, was not the best idea. All work on porting the toolkit should happen here. So we will take what we can use from the pattern3 repo, put it in here, and continue here, eventually discarding the pattern3 fork. It is less confusing for everyone if we work on 1 repo instead of 2 forks. Hopefully we can make some progress over the summer.

markus-beuckelmann commented 7 years ago

Thanks @tom-de-smedt, I hope it's going to be a productive summer! Here is how I plan to proceed...

Finally, and I think I speak in the name of many Pattern users/developers, special thanks to @hayd for all the valuable work done in the pattern3 fork. We will make use of it wherever reasonable.

talaikis commented 7 years ago

Just checked pattern3 and it seems it also uses sgmllib inside pattern/web, would be good someone is familiar how it works or what it does (maybe code for that lib?) as due to that one I can't even start running tests :) I think it can be changed to lxml. Every other test of pattern3 seems doesn't have such dependencies and can be worked out.

markus-beuckelmann commented 7 years ago

I realize I'm a bit behind on keeping people following this issue up to date with the latest progress! Google Summer of Code is over, since a couple weeks now already, and it has brought substantial progress (see full list of commits). We are now in a position where we have a version on the development branch that supports all modules except for pattern.server on both Python 2.7 and Python 3.5+. For people who want to find out more about the specifics and intermediate steps, go ahead and read my detailed GSoC reports on the Newsaudit blog (#1, #2, #3).

So now the plan is to smooth out the rough edges and release a new major version Pattern 3.0 within the next months. There is really only one known bug at the moment that is solely related to Python 3 and it only affects the information gain tree classifier IGTree in pattern.vector. Then there are a couple of issues like deprecated web APIs in pattern.web that should be addressed before the next release.

In the meantime, everybody feel free to check out the development branch and report any issues that may come along!