faust-streaming / cChardet

universal character encoding detector
Other
57 stars 4 forks source link

Take over the original PyPI project? #32

Open mgorny opened 1 year ago

mgorny commented 1 year ago

Since the original cchardet project is clearly no longer maintained, have you tried contacting the original author to give you permissions to take the PyPI project? And if that failed, applying for PEP 541 name reuse?

Creating a fork has the problem that some packages will now require cchardet and some will require faust-cchardet, and both can't be installed simultaneously which causes major problems for distributions.

wbarnha commented 1 year ago

I have reached out to the original author a long time ago but with no response. I forgot about PEP 541, thank you for bringing this to my attention. I will submit an application.

Edit: It seems after reading the requirements for reachability:

Reachability

The user of the Package Index is solely responsible for being reachable by the Package Index maintainers for matters concerning projects that the user owns. In every case where contacting the user is necessary, the maintainers will try to do so at least three times, using the following means of contact:

the e-mail address on file in the user’s profile on the Package Index;
the e-mail address listed in the Author field for a given project uploaded to the Index; and
any e-mail addresses found in the given project’s documentation on the Index or on the listed Home Page.

The maintainers stop trying to reach the user after six weeks.

It seems I need to reach out to PyYoshi a few more times before the owner is considered "unreachable".

mgorny commented 1 year ago

Thanks.

I'm not sure if you are actually supposed to do that, and not the person handling your request. After all, how can PyPI admins know that you've actually contacted them?

I think filing a bug on their GitHub would also be a good step, as that is publicly visible.

wbarnha commented 1 year ago

Thanks.

I'm not sure if you are actually supposed to do that, and not the person handling your request. After all, how can PyPI admins know that you've actually contacted them?

I would forward emails to PyPi admins as evidence.

I think filing a bug on their GitHub would also be a good step, as that is publicly visible.

Agreed, I don't like the idea of invoking PEP 541, but it seems that this project is in need of it. Opening up an issue in advance would be morally right.

Edit: Sorry, I'm tired. I misread maintainers, assuming it referred to me, not the index maintainers. I'm still going to reach out again to show good faith.

Mr0grog commented 1 year ago

Howdy! Are there any updates on this?

Barring that, is there a future where the top-level name of this package is changed to alleviate collisions? (Granted it is useful that you can install this in place of cchardet and magically make other packages that know nothing about it work, but it does make a lot of situations messy, as the OP noted.)

wbarnha commented 1 year ago

Sorry, there are no updates on this at the moment. I have not been able to allocate the time to work on this. :sweat:

wbarnha commented 1 year ago

Reached out to the original developer, haven't heard back.

mike-clark-8192 commented 8 months ago

Could we use GitHub actions to automate the release of this package to PyPI under a second, separate namespace? That way people who are experiencing conflicts over import cchardet have the option to depend on / pip install faust-faust_cchardet and use it as import faust_cchardet? The primary namespace for this fork could still be cchardet, but people could access it via the auto-sync'd auto-published second package name to avoid the namespace overlap if they need/want that.

Incomplete GitHub Actions idea ```yaml name: Publish to PyPI with Renamed Namespace on: push: tags: - 'v*' jobs: publish: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: { python-version: '3.x' } - name: Rename directory run: mv src/cchardet src/faust_cchardet - name: Update imports (if necessary) run: >- find . -type f -name '*.py' -exec sed -i 's/import cchardet/import faust_cchardet/g' {} + - name: Build the package run: python setup.py sdist bdist_wheel - name: Publish the package to PyPI uses: pypa/gh-action-pypi-publish@v1.4.2 with: user: __token__ password: ${{ secrets.PYPI_API_TOKEN }} ```
wbarnha commented 8 months ago

Could we use GitHub actions to automate the release of this package to PyPI under a second, separate namespace? That way people who are experiencing conflicts over import cchardet have the option to depend on / pip install faust-faust_cchardet and use it as import faust_cchardet? The primary namespace for this fork could still be cchardet, but people could access it via the auto-sync'd auto-published second package name to avoid the namespace overlap if they need/want that. Incomplete GitHub Actions idea

Hi, sorry I've been away! I've bit off more than I could chew, I didn't expect this revival to become so important as a dependency. I'll file a PEP 541 request for cchardet and kafka-python since I've been meaning to hand off these projects to people who have more of a pertinent interest in them.

milahu commented 1 week ago

Since the original cchardet project is clearly no longer maintained

no longer true PyYoshi/cChardet had no pypi releases between 2020-10-27 and 2024-06-06

diff

cd $(mktemp -d)
git clone --depth=1 https://github.com/PyYoshi/cChardet
cd cChardet/
git remote add faust-cchardet https://github.com/faust-streaming/cChardet
git fetch faust-cchardet master
git rev-parse master
# fa74a8e43a2685767296f4cc5bc4594d28713ab1
git rev-parse faust-cchardet/master
# 3af7068fc6f04dc777531da021057bfbe75313b2
git diff --stat master faust-cchardet/master -- src/cchardet/
git diff master faust-cchardet/master -- src/cchardet/

git diff --stat

 src/cchardet/__init__.py        | 10 ++--------
 src/cchardet/__main__.py        |  4 ----
 src/cchardet/_cchardet.pyx      | 43 ++++++++-----------------------------------
 src/cchardet/cli/__init__.py    |  0
 src/cchardet/cli/cchardetect.py | 40 ----------------------------------------
 src/cchardet/version.py         |  1 +
 6 files changed, 11 insertions(+), 87 deletions(-)
git diff ```diff diff --git a/src/cchardet/__init__.py b/src/cchardet/__init__.py index f616d7f..c6db442 100644 --- a/src/cchardet/__init__.py +++ b/src/cchardet/__init__.py @@ -1,7 +1,5 @@ -from . import _cchardet - -version = (2, 2, 0, "alpha", 3) -__version__ = "2.2.0a3" +from cchardet import _cchardet +from .version import __version__ def detect(msg): @@ -17,10 +15,6 @@ def detect(msg): encoding, confidence = _cchardet.detect_with_confidence(msg) if isinstance(encoding, bytes): encoding = encoding.decode() - - if encoding == "MAC-CENTRALEUROPE": - encoding = "maccentraleurope" - return {"encoding": encoding, "confidence": confidence} diff --git a/src/cchardet/__main__.py b/src/cchardet/__main__.py deleted file mode 100644 index a3e0fd8..0000000 --- a/src/cchardet/__main__.py +++ /dev/null @@ -1,4 +0,0 @@ -from .cli.cchardetect import main - -if __name__ == "__main__": - main() diff --git a/src/cchardet/_cchardet.pyx b/src/cchardet/_cchardet.pyx index 27d9f55..75af096 100644 --- a/src/cchardet/_cchardet.pyx +++ b/src/cchardet/_cchardet.pyx @@ -1,26 +1,19 @@ -# coding: utf-8 -#cython: embedsignature=True, c_string_encoding=ascii, language_level=3 - cdef extern from *: ctypedef char* const_char_ptr "const char*" - ctypedef unsigned long size_t -# uchardet v0.0.8 cdef extern from "uchardet.h": ctypedef void* uchardet_t cdef uchardet_t uchardet_new() cdef void uchardet_delete(uchardet_t ud) - cdef int uchardet_handle_data(uchardet_t ud, const_char_ptr data, size_t length) + cdef int uchardet_handle_data(uchardet_t ud, const_char_ptr data, int length) cdef void uchardet_data_end(uchardet_t ud) cdef void uchardet_reset(uchardet_t ud) cdef const_char_ptr uchardet_get_charset(uchardet_t ud) - cdef float uchardet_get_confidence(uchardet_t ud, size_t i) - # cdef const_char_ptr uchardet_get_encoding(uchardet_t ud, size_t i) - # cdef const_char_ptr uchardet_get_language(uchardet_t ud, size_t i) + cdef float uchardet_get_confidence(uchardet_t ud) def detect_with_confidence(bytes msg): - cdef size_t length = len(msg) - + cdef int length = len(msg) + cdef uchardet_t ud = uchardet_new() cdef int result = uchardet_handle_data(ud, msg, length) @@ -30,17 +23,8 @@ def detect_with_confidence(bytes msg): uchardet_data_end(ud) - cdef bytes detected_charset - # cdef bytes detected_encoding - # cdef const_char_ptr detected_language - cdef float detected_confidence - - detected_charset = uchardet_get_charset(ud) - # detected_encoding = uchardet_get_encoding(ud, 0) - # detected_language = uchardet_get_language(ud, 0) - detected_confidence = uchardet_get_confidence(ud, 0) - - uchardet_reset(ud) + cdef bytes detected_charset = uchardet_get_charset(ud) + cdef float detected_confidence = uchardet_get_confidence(ud) uchardet_delete(ud) if detected_charset: @@ -53,8 +37,6 @@ cdef class UniversalDetector: cdef int _done cdef int _closed cdef bytes _detected_charset - # cdef bytes _detected_encoding - # cdef const_char_ptr _detected_language cdef float _detected_confidence def __init__(self): @@ -62,8 +44,6 @@ cdef class UniversalDetector: self._done = 0 self._closed = 0 self._detected_charset = b"" - # self._detected_encoding = b"" - # self._detected_language = b"" self._detected_confidence = 0.0 def reset(self): @@ -71,8 +51,6 @@ cdef class UniversalDetector: self._done = 0 self._closed = 0 self._detected_charset = b"" - # self._detected_encoding = b"" - # self._detected_language = b"" self._detected_confidence = 0.0 uchardet_reset(self._ud) @@ -95,18 +73,13 @@ cdef class UniversalDetector: self._done = 1 self._detected_charset = uchardet_get_charset(self._ud) - # self._detected_encoding = uchardet_get_encoding(self._ud, 0) - # self._detected_language = uchardet_get_language(self._ud, 0) - self._detected_confidence = uchardet_get_confidence(self._ud, 0) + self._detected_confidence = uchardet_get_confidence(self._ud) def close(self): if not self._closed: uchardet_data_end(self._ud) - self._detected_charset = uchardet_get_charset(self._ud) - # self._detected_encoding = uchardet_get_encoding(self._ud, 0) - # self._detected_language = uchardet_get_language(self._ud, 0) - self._detected_confidence = uchardet_get_confidence(self._ud, 0) + self._detected_confidence = uchardet_get_confidence(self._ud) uchardet_delete(self._ud) self._closed = 1 diff --git a/src/cchardet/cli/__init__.py b/src/cchardet/cli/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/src/cchardet/cli/cchardetect.py b/src/cchardet/cli/cchardetect.py deleted file mode 100755 index 485174c..0000000 --- a/src/cchardet/cli/cchardetect.py +++ /dev/null @@ -1,40 +0,0 @@ -import argparse -import sys - -from .. import UniversalDetector, __version__ - - -def read_chunks(f, chunk_size): - chunk = f.read(chunk_size) - while chunk: - yield chunk - chunk = f.read(chunk_size) - - -def main(): - parser = argparse.ArgumentParser() - parser.add_argument( - "files", - nargs="*", - help="Files to detect encoding of", - type=argparse.FileType("rb"), - default=[sys.stdin.buffer], - ) - parser.add_argument("--chunk-size", type=int, default=(256 * 1024)) - parser.add_argument("--version", action="version", version="%(prog)s {0}".format(__version__)) - args = parser.parse_args() - - for f in args.files: - detector = UniversalDetector() - for chunk in read_chunks(f, args.chunk_size): - detector.feed(chunk) - detector.close() - print( - "{file.name}: {result[encoding]} with confidence {result[confidence]}".format( - file=f, result=detector.result - ) - ) - - -if __name__ == "__main__": - main() diff --git a/src/cchardet/version.py b/src/cchardet/version.py new file mode 100644 index 0000000..f43fee1 --- /dev/null +++ b/src/cchardet/version.py @@ -0,0 +1 @@ +__version__ = '2.1.19' ```