deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Get a new maintainer #386

Closed traverseda closed 3 years ago

traverseda commented 3 years ago

There are a bunch of pull requests that are highly relevant and have been left open for more than a year.

I think it's time to try and find a new maintainer for this project.

deanmalmgren commented 3 years ago

Totally agree @traverseda . Is this something that you might be interested in?

A while back (#117) @jpweytjens and @DanielSwain offered to help co-maintain textract but I gather that their energy has also been diverted to other things.

traverseda commented 3 years ago

My day job keeps me pretty busy, I'm not sure that I currently have the time to help maintain this. It looks like you have a good testing strategy in place which means that randomly merging pull requests is a valid strategy, but I'm not sure I've got the time to do more than just randomly merging pull requests.

I'd be willing to give it a shot, if only to get a few breaking bugs dealt with.

jpweytjens commented 3 years ago

I'm afraid I took up co-maintainership underestimating the time investment to do this properly. I have replied and closed quite a few issues and pushed a few small enhancements. Some of the issues and PR's regarding outdated requirements is a more difficult issue in my mind. There are both people asking for recent versions of dependencies and others asking for flexible requirements such that their older environments aren't affected.

I had the motivation to develop a textract v2.0 aiming to fix the most common issues, being missing requirements (external parsers), encoding choice and errors, automatic filetype detection based on extension or mime-type and support for file streams. I have some unfinished code that reimplements textract as a graph. Possible nodes are a detection node, parser node or convertor node. The detection node detects the encoding and filetype. When constructing the graph, the parser and convertor nodes test if the (external) dependency is available and are only added to the graph when this is the case. The idea is to maintain the ease-of-use of the current version of textract, while providing a modular structure that allows many users to easily submit a PR with a new parser by simply constructing a new node.

I'm still interested to finish this idea, but can't give any indication of when this would be finished. In the meanwhile, I would welcome and help with co-maintership.

traverseda commented 3 years ago

Alright, well I can take a shot at merging in a few of the outstanding pull requests this weekend if I'm given permissions. I think that #285 #323 #326 #348 and #336 should all be pretty easy to merge in, and pretty uncontroversial. Right now the travis-CI builds are failing for some of them and I'll need to figure out what that is first, but hopefully that isn't too difficult.

deanmalmgren commented 3 years ago

Hired! I added you as a collaborator on the repo, @traverseda. If you don't mind merging the PRs that make sense, that would be great. If there's anything I can do to help, let me know!

Dean

Dean Malmgren (he/him) IDEO | Executive Director +1.734.417.3509 | @deanmalmgren

On Fri, Jul 23, 2021 at 8:40 AM traverseda @.***> wrote:

Alright, well I can take a shot at merging in a few of the outstanding pull requests this weekend if I'm given permissions. I think that #285 https://github.com/deanmalmgren/textract/pull/285 #323 https://github.com/deanmalmgren/textract/pull/323 #326 https://github.com/deanmalmgren/textract/pull/326 #348 https://github.com/deanmalmgren/textract/pull/348 and #336 https://github.com/deanmalmgren/textract/pull/336 should all be pretty easy to merge in, and pretty uncontroversial. Right now the travis-CI builds are failing for some of them and I'll need to figure out what that is first, but hopefully that isn't too difficult.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deanmalmgren/textract/issues/386#issuecomment-885646794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB6NOHQXM75HBHPXPVK4GDTZFWOFANCNFSM5AUBTMSA .

traverseda commented 3 years ago

Yay.

Definitely not going to have time to work on any of this until the weekend.

violuke commented 3 years ago

Thanks guys for helping this project move forward. I personally would really benefit from the package updates being merged in as I keep needing to use the latest pyup-bot branch (which changes every 2 weeks) to keep Poerty happy with this alongside our other dependencies https://github.com/deanmalmgren/textract/pulls/pyup-bot

Cheers

traverseda commented 3 years ago

Looks like some of those use semvar so we can loosen our requirements a fair bit. Instead of depending on six==1.12.0 we can use six==1.* safely. Hopefully I'll get some time to work on this this weekend, last weekend I got busy helping re-wire a car. Feel free to continue to bug me about it.

deanmalmgren commented 3 years ago

Sounds good to me. I can't remember if/why I set up pyup bot to have specific versions, but I'm sure there's a better way to configure it.

Dean

On Tue, Jul 27, 2021 at 10:56 AM traverseda @.***> wrote:

Looks like some of those use semvar so we can loosen our requirements a fair bit. Instead of depending on six==1.12.0 we can use six==1. safely. Hopefully I'll get some time to work on this this* weekend, last weekend I got busy helping re-wire a car. Feel free to continue to bug me about it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deanmalmgren/textract/issues/386#issuecomment-887631649, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB6NOG6Y5CHGO5ZB3J2XCLTZ3JJBANCNFSM5AUBTMSA .