camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
3.01k stars 472 forks source link

We need more maintainers #343

Open MartinThoma opened 1 year ago

MartinThoma commented 1 year ago

It seems like camelot is dead:

Besides the owner there are only 35 other contributors.

https://opencollective.com/camelot might be another way to check if it's dead.

Does anybody know more? Should we try to transfer the project to https://github.com/jazzband ?

MartinThoma commented 1 year ago

Or is somebody there who would like to become a maintainer?

vinayak-mehta commented 1 year ago

Hi @MartinThoma, sorry for not being responsive here. I've been busy with some life stuff for some time now and haven't had the mindspace to look into the issues here. I've been wanting to get back into it, I'll look into them over this weekend.

I also want to stop being a single point of failure here and would love to get help maintaining camelot going forward.

foarsitter commented 1 year ago

Is there any movement on this topic?

ramSeraph commented 1 year ago

@vinayak-mehta Would you mind if I post this on the Indian FOSS and opendata channels? This tool has been extremely helpful in dealing with the PDF crap Indian Government puts out.

foarsitter commented 1 year ago

Sorry for my inpatientence but the show must go on. In order to take camelot to production I created a fork and released it to pypi under camelot-fork==0.20.0. My intentions are limited and I hope this project finds new maintainers soon.

When the request came to extract tables from pdf files I thought it would be very tricky job but camelot does it all. Therefore I want to express my gratitude to all of you that made that possible.

If people are in need of a fix I'm willing to accept pull request as long as they have test-coverage.

vinayak-mehta commented 1 year ago

Sorry for being a bit unresponsive since I created this issue. I've pushed a release based on @MartinThoma's last PR: https://pypi.org/project/camelot-py/

@MartinThoma Thank you for the PR, would you like to be added to the github org so that you have push access to the repo?

@foarsitter Are you interested in maintaining the project here instead of the fork? I can add you to the github org too.

MartinThoma commented 1 year ago

Thank you for making a new release :pray:

would you like to be added to the github org so that you have push access to the repo?

I would probably not be super active as I spend most of my time with pypdf. If that is ok for you, then yes, please add me :-) I could probably go over a couple of the PRs / ~introduce~ update CI so that maintaining the library becomes easier :-)

vinayak-mehta commented 1 year ago

@MartinThoma That would be awesome! Just sent you an invite ✉️

foarsitter commented 1 year ago

@vinayak-mehta should be awesome

MartinThoma commented 1 year ago

Are there any rules I should follow, e.g.

  1. Reviews: When I make a PR, should I ask you (or somebody else) for a review before merging it? (Here is the first one, btw: https://github.com/camelot-dev/camelot/pull/356 :smile: )
  2. Commit Messages: e.g. something like https://docs.scipy.org/doc/scipy/dev/contributor/development_workflow.html#writing-the-commit-message ?
  3. Merges: Merge / Squash+Merge / Rebase+Merge: I prefer squash+merge, but you seem to do normal merges only. Is sqash+merge ok?
  4. Tests: Do you have any hard rules in regards to unit tests, e.g. that every new feature needs to have full test coverage?
foarsitter commented 1 year ago

As I see it:

1) a review is always good, unless it is something realy trivial. Be patiënt. Reverting releases because we are to eager is something we should want to avoid. 2) a commit message should be clear about its contents, which style applied is less important to me. As I see it we can generate a changelog based on the titles of the merged pull requests (see my fork: https://github.com/foarsitter/camelot/releases) 3) If there are more commit message needed because there are changes in various parts of the codebase then a squash seems not a good fit to me, so it depends on the PR. 4) Adding tests afterwards is really hard, even harder when you are not the author of the code. So full coverage is recommended here in my point of view. If the addition is trivial the test should be trivial too right?

vinayak-mehta commented 1 year ago

@foarsitter I just invited you, sorry it took so long

I'm a big fan of the scikit-learn contributing guidelines.

  1. We should go for at least 1 review.
  2. I agree with @foarsitter's point.
  3. I'm gonna lean towards squash + merge because it's easier to revert, just in case we need to do that. It should also lead to small PRs which would be easier to review. If we have PRs with major enhancements that touch various parts of the codebase, then squashing might not make sense.
  4. I agree with @foarsitter's point.
nmstoker commented 1 year ago

I'd be interested in contributing, particularly to the docs initially.

It seems to me there's a lot of value in this repo, but things seem to have got into a fairly confusing state. Devoting time to it, I think I've figured out most of the misunderstandings I had and it seems like it's worth sharing / updating docs, so that others don't fall into the exact same traps I (and others) did.

Correct me if I'm wrong but I get the sense that what has made things harder overall is that the migration to pdftopng/poppler backend was in progress yet not completed when the maintenance fell away (quite reasonably given world events!)

@foarsitter 's idea of a fork that cuts pdftopng out is interesting, although I would feel more comfortable if it was directly part of the main repo.

How feasible is it to make the "base" install be equivalent to the fork (ie such that it doesn't install pdftopng as a requirement)? And with that, introduce a "pdftopng" extra requires option so people can optionally try it and then - only once it's deemed to work sufficiently well - it is switched to be what gets delivered with "base" at some later point. Presumably for that to happen there needs to be a bit of maintenance upstream in pdftopng too. If this last paragraph is best discussed in a separate issue, that's fine by me, just say 🙂

MartinThoma commented 1 year ago

@nmstoker I cannot answer that question, but at least I could review/merge PRs with documentation updates :-) so if there are specific learnings you want to share, I would support you :-)

nmstoker commented 1 year ago

Sounds a good start, thanks @MartinThoma !

foarsitter commented 1 year ago

@nmstoker Looking forward to your learnings!

kshitiz305 commented 1 year ago

After using the products for a long time in my developer career, I just started my contribution to Camlot with my in initial pull request for (#364). I would love to contribute to other projects as well. Thanks

bosd commented 1 year ago

How about Excalibur?? That might need some :heart: as well. There is still an open refresh issue on windows which makes it unusuable.

P.s. I'm happy to contribute/maintain a bit on both projects.

foarsitter commented 1 year ago

Looking forward to your contributions @bosd

MartinThoma commented 1 year ago

@vinayak-mehta Have you seen my e-mail?

  1. PyPI permissions: Can you please give me Owner permissions (instead of just Maintainer) via https://pypi.org/manage/project/camelot-py/collaboration/ so that I can take care of https://github.com/camelot-dev/camelot/issues/389 ?
  2. Github permissions: Can you please give me Admin permissions via https://github.com/camelot-dev/camelot-py/settings/access so that I can allow merge-commits for https://github.com/camelot-dev/camelot/pull/353 ?
  3. Project Governance: Would you be OK with the Github organization camelot-dev merging into py-pdf?
bosd commented 1 year ago

@MartinThoma How about Owner / Admin persmissions for Excalibur?

bahoo commented 1 year ago

Just wandering in but happy to contribute. 👋🏻

ZupoLlask commented 11 months ago

@vinayak-mehta Have you seen my e-mail?

  1. PyPI permissions: Can you please give me Owner permissions (instead of just Maintainer) via https://pypi.org/manage/project/camelot-py/collaboration/ so that I can take care of Release to PyPI via Github Action #389 ?
  2. Github permissions: Can you please give me Admin permissions via https://github.com/camelot-dev/camelot-py/settings/access so that I can allow merge-commits for Release camelot-fork 0.20.1 #353 ?
  3. Project Governance: Would you be OK with the Github organization camelot-dev merging into py-pdf?

Are these permission issues solved already, @MartinThoma?

Can you please take care of these blockers, @vinayak-mehta?

MartinThoma commented 11 months ago

No. I still don't have sufficient permissions to bring the project back to life. Camelot is dead.

johnthagen commented 10 months ago

In case this helps others (since we didn't know until we tried camelot and ran into various issues how much it's maintenance is suffering), here are a few active PDF processing alternatives in the Python ecosystem:

nmstoker commented 10 months ago

Not sure if people saw it, but in #479 I show some ideas I had with the docs.

With care I think it should be feasible to guide most people around the current difficulties with installation (I've managed setup in Windows and various Linux environments, no access to Mac but guess it's not that different to the Linux steps for the most part)

MartinThoma commented 8 months ago

We need to fork camelot if we want to continue developing it.

I've already talked with the people of py-pdf (website) and they are fine moving it there. But we need two people who would take care of it so that it's not another dead version.

@bosd @foarsitter Would it still be fine to you to become the new maintainers?

Discussion is here: https://github.com/py-pdf/pypdf/discussions/2466

foarsitter commented 8 months ago

@MartinThoma I'm willing to help where I can!

python3-dev commented 8 months ago

@MartinThoma : Please pull me in. I would like to contribute to the code.

ammadakram commented 7 months ago

I can fix the PdfFileReader deprecation error, please pull me in.

bosd commented 7 months ago

I can fix the PdfFileReader deprecation error, please pull me in.

@ammadakram Can you please open a PR here: https://github.com/py-pdf/pypdf_table_extraction

jatinchhabriya commented 2 months ago

@MartinThoma @vinayak-mehta @bosd I am facing the same error as Kushal, Expected Output: List of tables Standard Output since this week: "Attribute Error: File Format not supported". Could you please let me know if a fix has been deployed on the forked branch, this was working a week ago and for my particular use case lattice boundary provided exclusively in camelot-py[cv] is required.

bosd commented 2 months ago

Please try the code from the new repo. If the problem exists, please open a issue there.