PabloCastellano / bormeparser

A Python library for parsing BORME files (BoletΓ­n Oficial del Registro Mercantil in Spain).
GNU General Public License v3.0
46 stars 20 forks source link

Update pypdf2 to 2.1.0 #78

Closed pyup-bot closed 2 years ago

pyup-bot commented 2 years ago

This PR updates PyPDF2 from 1.26.0 to 2.1.0.

Changelog ### 2.1.0 ``` ------------------------- The highlight of the 2.1.0 release is the most massive improvement to the text extraction capabilities of PyPDF2 since 2016 πŸ₯³πŸŽŠ A very big thank you goes to [pubpub-zz](https://github.com/pubpub-zz) who took a lot of time and knowledge about the PDF format to finally get those improvements into PyPDF2. Thank you πŸ€—πŸ’š In case the new function causes any issues, you can use `_extract_text_old` for the old functionality. Please also open a bug ticket in that case. There were several people who have attempted to bring similar improvements to PyPDF2. All of those were valuable. The main reason why they didn't get merged is the big amount of open PRs / issues. pubpub-zz was the most comprehensive PR which also incorporated the latest changes of PyPDF2 2.0.0. Thank you to [VictorCarlquist](https://github.com/VictorCarlquist) for #858 and [asabramo](https://github.com/asabramo) for #464 πŸ€— New Features (ENH): - Massive text extraction improvement (924). Closed many open issues: - Exceptions / missing spaces in extract_text() method (17) πŸ•Ί - Whitespace issues in extract_text() (42) πŸ’ƒ - pypdf2 reads the hifenated words in a new line (246) - PyPDF2 failing to read unicode character (37) - Unable to read bullets (230) - ExtractText yields nothing for apparently good PDF (168) πŸŽ‰ - Encoding issue in extract_text() (235) - extractText() doesn't work on Chinese PDF (252) - encoding error (260) - Trouble with apostophes in names in text "O'Doul" (384) - extract_text works for some PDF files, but not the others (437) - Euro sign not being recognized by extractText (443) - Failed extracting text from French texts (524) - extract_text doesn't extract ligatures correctly (598) - reading spanish text - mark convert issue (635) - Read PDF changed from text to random symbols (654) - .extractText() reads / as 1. (789) - Update glyphlist (947) - inspired by 464 - Allow adding PageRange objects (948) Bug Fixes (BUG): - Delete .python-version file (944) - Compare StreamObject.decoded_self with None (931) Robustness (ROB): - Fix some conversion errors on non conform PDF (932) Documentation (DOC): - Elaborate on PDF text extraction difficulties (939) - Add logo (942) - rotate vs Transformation().rotate (937) - Example how to use PyPDF2 with AWS S3 (938) - How to deprecate (930) - Fix typos on robustness page (935) - Remove scripts (pdfcat) from docs (934) Developer Experience (DEV): - Ignore .python-version file - Mark deprecated code with no-cover (943) - Automatically create Github releases from tags (870) Testing (TST): - Text extraction for non-latin alphabets (954) - Ignore PdfReadWarning in benchmark (949) - writer.remove_text (946) - Add test for Tree and _security (945) Code Style (STY): - black, isort, Flake8, splitting buildCharMap (950) Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.0.0...2.1.0 ``` ### 2.0.0 ``` and variable-names as well as using properties instead of getter-methods. Maintenance (MAINT): - Remove IronPython Fallback for zlib (868) Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.27.12...1.27.13 Deprecations (DEP) * Make the `PyPDF2.utils` module private * Rename of core classes: * PdfFileReader βž” PdfReader * PdfFileWriter βž” PdfWriter * PdfFileMerger βž” PdfMerger * Use PEP8 conventions for function names and parameters * If a property and a getter-method are both present, use the property Details In many places: - getObject βž” get_object - writeToStream βž” write_to_stream - readFromStream βž” read_from_stream PyPDF2.generic - readObject βž” read_object - convertToInt βž” convert_to_int - DocumentInformation.getText βž” DocumentInformation._get_text : This method should typically not be used; please let me know if you need it. PdfReader class: - `reader.getPage(pageNumber)` βž” `reader.pages[page_number]` - `reader.getNumPages()` / `reader.numPages` βž” `len(reader.pages)` - getDocumentInfo βž” metadata - flattenedPages attribute βž” flattened_pages - resolvedObjects attribute βž” resolved_objects - xrefIndex attribute βž” xref_index - getNamedDestinations / namedDestinations attribute βž” named_destinations - getPageLayout / pageLayout βž” page_layout attribute - getPageMode / pageMode βž” page_mode attribute - getIsEncrypted / isEncrypted βž” is_encrypted attribute - getOutlines βž” get_outlines - readObjectHeader βž” read_object_header - cacheGetIndirectObject βž” cache_get_indirect_object - cacheIndirectObject βž” cache_indirect_object - getDestinationPageNumber βž” get_destination_page_number - readNextEndLine βž” read_next_end_line - _zeroXref βž” _zero_xref - _authenticateUserPassword βž” _authenticate_user_password - _pageId2Num attribute βž” _page_id2num - _buildDestination βž” _build_destination - _buildOutline βž” _build_outline - _getPageNumberByIndirect(indirectRef) βž” _get_page_number_by_indirect(indirect_ref) - _getObjectFromStream βž” _get_object_from_stream - _decryptObject βž” _decrypt_object - _flatten(..., indirectRef) βž” _flatten(..., indirect_ref) - _buildField βž” _build_field - _checkKids βž” _check_kids - _writeField βž” _write_field - _write_field(..., fieldAttributes) βž” _write_field(..., field_attributes) - _read_xref_subsections(..., getEntry, ...) βž” _read_xref_subsections(..., get_entry, ...) PdfWriter class: - `writer.getPage(pageNumber)` βž” `writer.pages[page_number]` - `writer.getNumPages()` βž” `len(writer.pages)` - addMetadata βž” add_metadata - addPage βž” add_page - addBlankPage βž” add_blank_page - addAttachment(fname, fdata) βž” add_attachment(filename, data) - insertPage βž” insert_page - insertBlankPage βž” insert_blank_page - appendPagesFromReader βž” append_pages_from_reader - updatePageFormFieldValues βž” update_page_form_field_values - cloneReaderDocumentRoot βž” clone_reader_document_root - cloneDocumentFromReader βž” clone_document_from_reader - getReference βž” get_reference - getOutlineRoot βž” get_outline_root - getNamedDestRoot βž” get_named_dest_root - addBookmarkDestination βž” add_bookmark_destination - addBookmarkDict βž” add_bookmark_dict - addBookmark βž” add_bookmark - addNamedDestinationObject βž” add_named_destination_object - addNamedDestination βž” add_named_destination - removeLinks βž” remove_links - removeImages(ignoreByteStringObject) βž” remove_images(ignore_byte_string_object) - removeText(ignoreByteStringObject) βž” remove_text(ignore_byte_string_object) - addURI βž” add_uri - addLink βž” add_link - getPage(pageNumber) βž” get_page(page_number) - getPageLayout / setPageLayout / pageLayout βž” page_layout attribute - getPageMode / setPageMode / pageMode βž” page_mode attribute - _addObject βž” _add_object - _addPage βž” _add_page - _sweepIndirectReferences βž” _sweep_indirect_references PdfMerger class - `__init__` parameter: strict=True βž” strict=False (the PdfFileMerger still has the old default) - addMetadata βž” add_metadata - addNamedDestination βž” add_named_destination - setPageLayout βž” set_page_layout - setPageMode βž” set_page_mode Page class: - artBox / bleedBox/ cropBox/ mediaBox / trimBox βž” artbox / bleedbox/ cropbox/ mediabox / trimbox - getWidth, getHeight βž” width / height - getLowerLeft_x / getUpperLeft_x βž” left - getUpperRight_x / getLowerRight_x βž” right - getLowerLeft_y / getLowerRight_y βž” bottom - getUpperRight_y / getUpperLeft_y βž” top - getLowerLeft / setLowerLeft βž” lower_left property - upperRight βž” upper_right - mergePage βž” merge_page - rotateClockwise / rotateCounterClockwise βž” rotate_clockwise - _mergeResources βž” _merge_resources - _contentStreamRename βž” _content_stream_rename - _pushPopGS βž” _push_pop_gs - _addTransformationMatrix βž” _add_transformation_matrix - _mergePage βž” _merge_page XmpInformation class: - getElement(..., aboutUri, ...) βž” get_element(..., about_uri, ...) - getNodesInNamespace(..., aboutUri, ...) βž” get_nodes_in_namespace(..., aboutUri, ...) - _getText βž” _get_text utils.py: - matrixMultiply βž” matrix_multiply - RC4_encrypt is moved to the security module ``` ### 1.28.4 ``` -------------------------- Bug Fixes (BUG): - XmpInformation._converter_date was unusable (921) Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.28.3...1.28.4 ``` ### 1.28.3 ``` -------------------------- Deprecations (DEP): - PEP8 renaming (905) Bug Fixes (BUG): - XmpInformation missing method _getText (917) - Fix PendingDeprecationWarning on _merge_page (904) Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.28.2...1.28.3 ``` ### 1.28.2 ``` -------------------------- Bug Fixes (BUG): - PendingDeprecationWarning for getContents (893) - PendingDeprecationWarning on using PdfMerger (891) Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.28.1...1.28.2 ``` ### 1.28.1 ``` -------------------------- Bug Fixes (BUG): - Incorrectly show deprecation warnings on internal usage (887) Maintenance (MAINT): - Add stacklevel=2 to deprecation warnings (889) - Remove duplicate warnings imports (888) Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.28.0...1.28.1 ``` ### 1.28.0 ``` -------------------------- This release adds a lot of deprecation warnings in preparation of the ``` ### 1.27.12 ``` --------------------------- Bug Fixes (BUG): - _rebuild_xref_table expects trailer to be a dict (857) Documentation (DOC): - Security Policy Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.27.11...1.27.12 ``` ### 1.27.11 ``` --------------------------- Bug Fixes (BUG): - Incorrectly issued xref warning/exception (855) Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.27.10...1.27.11 ``` ### 1.27.10 ``` --------------------------- Robustness (ROB): - Handle missing destinations in reader (840) - warn-only in readStringFromStream (837) - Fix corruption in startxref or xref table (788 and 830) Documentation (DOC): - Project Governance (799) - History of PyPDF2 - PDF feature/version support (816) - More details on text parsing issues (815) Developer Experience (DEV): - Add benchmark command to Makefile - Ignore IronPython parts for code coverage (826) Maintenance (MAINT): - Split pdf module (836) - Separated CCITTFax param parsing/decoding (841) - Update requirements files Testing (TST): - Use external repository for larger/more PDFs for testing (820) - Swap incorrect test names (838) - Add test for PdfFileReader and page properties (835) - Add tests for PyPDF2.generic (831) - Add tests for utils, form fields, PageRange (827) - Add test for ASCII85Decode (825) - Add test for FlateDecode (823) - Add test for filters.ASCIIHexDecode (822) Code Style (STY): - Apply pre-commit (black, isort) + use snake_case variables (832) - Remove debug code (828) - Documentation, Variable names (839) Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.27.9...1.27.10 ``` ### 1.27.9 ``` -------------------------- A change I would like to highlight is the performance improvement for large PDF files (808) πŸŽ‰ New Features (ENH): - Add papersizes (800) - Allow setting permission flags when encrypting (803) - Allow setting form field flags (802) Bug Fixes (BUG): - TypeError in xmp._converter_date (813) - Improve spacing for text extraction (806) - Fix PDFDocEncoding Character Set (809) Robustness (ROB): - Use null ID when encrypted but no ID given (812) - Handle recursion error (804) Documentation (DOC): - CMaps (811) - The PDF Format + commit prefixes (810) - Add compression example (792) Developer Experience (DEV): - Add Benchmark for Performance Testing (781) Maintenance (MAINT): - Validate PDF magic byte in strict mode (814) - Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (339) - Quadratic runtime while parsing reduced to linear (808) Testing (TST): - Newlines in text extraction (807) Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.27.8...1.27.9 ``` ### 1.27.8 ``` -------------------------- Bug Fixes (BUG): - Use 1MB as offset for readNextEndLine (321) - 'PdfFileWriter' object has no attribute 'stream' (787) Robustness (ROB): - Invalid float object; use 0 as fallback (782) Documentation (DOC): - Robustness (785) Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.27.7...1.27.8 ``` ### 1.27.7 ``` -------------------------- Bug Fixes (BUG): - Import exceptions from PyPDF2.errors in PyPDF2.utils (780) Code Style (STY): - Naming in 'make_changelog.py' ``` ### 1.27.6 ``` -------------------------- Deprecations (DEP): - Remove support for Python 2.6 and older (776) New Features (ENH): - Extract document permissions (320) Bug Fixes (BUG): - Clip by trimBox when merging pages, which would otherwise be ignored (240) - Add overwriteWarnings parameter PdfFileMerger (243) - IndexError for getPage() of decryped file (359) - Handle cases where decodeParms is an ArrayObject (405) - Updated PDF fields don't show up when page is written (412) - Set Linked Form Value (414) - Fix zlib -5 error for corrupt files (603) - Fix reading more than last1K for EOF (642) - Acciental import Robustness (ROB): - Allow extra whitespace before "obj" in readObjectHeader (567) Documentation (DOC): - Link to pdftoc in Sample_Code (628) - Working with annotations (764) - Structure history Developer Experience (DEV): - Add issue templates (765) - Add tool to generate changelog Maintenance (MAINT): - Use grouped constants instead of string literals (745) - Add error module (768) - Use decorators for staticmethod (775) - Split long functions (777) Testing (TST): - Run tests in CI once with -OO Flags (770) - Filling out forms (771) - Add tests for Writer (772) - Error cases (773) - Check Error messages (769) - Regression test for issue 88 - Regression test for issue 327 Code Style (STY): - Make variable naming more consistent in tests All changes: https://github.com/py-pdf/PyPDF2/compare/1.27.5...1.27.6 ``` ### 1.27.5 ``` -------------------------- Security (SEC): - ContentStream_readInlineImage had potential infinite loop (740) Bug fixes (BUG): - Fix merging encrypted files (757) - CCITTFaxDecode decodeParms can be an ArrayObject (756) Robustness improvements (ROBUST): - title sometimes None (744) Documentation (DOC): - Adjust short description of the package Tests and Test setup (TST): - Rewrite JS tests from unittest to pytest (746) - Increase Test coverage, mainly with filters (756) - Add test for inline images (758) Developer Experience Improvements (DEV): - Remove unused Travis-CI configuration (747) - Show code coverage (754, 755) - Add mutmut (760) Miscellaneous: - STY: Closing file handles, explicit exports, ... (743) All changes: https://github.com/py-pdf/PyPDF2/compare/1.27.4...1.27.5 ``` ### 1.27.4 ``` -------------------------- Bug fixes (BUG): - Guard formatting of __init__.__doc__ string (738) Packaging (PKG): - Add more precise license field to setup (733) Testing (TST): - Add test for issue 297 Miscellaneous: - DOC: Miscallenious βž” Miscellaneous (Typo) - TST: Fix CI triggering (master βž” main) (739) - STY: Fix various style issues (742) All changes: https://github.com/py-pdf/PyPDF2/compare/1.27.3...1.27.4 ``` ### 1.27.3 ``` -------------------------- - PKG: Make Tests not a subpackage (728) - BUG: Fix ASCII85Decode.decode assertion (729) - BUG: Error in Chinese character encoding (463) - BUG: Code duplication in Scripts/2-up.py - ROBUST: Guard 'obj.writeToStream' with 'if obj is not None' - ROBUST: Ignore a /Prev entry with the value 0 in the trailer - MAINT: Remove Sample_Code (726) - TST: Close file handle in test_writer (722) - TST: Fix test_get_images (730) - DEV: Make tox use pytest and add more Python versions (721) - DOC: Many (720, 723-725, 469) All changes: https://github.com/py-pdf/PyPDF2/compare/1.27.2...1.27.3 ``` ### 1.27.2 ``` -------------------------- - Add Scripts (including `pdfcat`), Resources, Tests, and Sample_Code back to PyPDF2. It was removed by accident in 1.27.0, but might get removed with 2.0.0 See https://github.com/py-pdf/PyPDF2/discussions/718 for discussion All changes: https://github.com/py-pdf/PyPDF2/compare/1.27.1...1.27.2 ``` ### 1.27.1 ``` -------------------------- - Fixed project links on PyPI page after migration from mstamy2 to MartinThoma to the py-pdf organization on GitHub - Documentation is now at https://pypdf2.readthedocs.io/en/latest/ All changes: https://github.com/py-pdf/PyPDF2/compare/1.27.0...1.27.1 ``` ### 1.27.0 ``` -------------------------- Features: - Add alpha channel support for png files in Script (614) Bug fixes (BUG): - Fix formatWarning for filename without slash (612) - Add whitespace between words for extractText() (569, 334) - "invalid escape sequence" SyntaxError (522) - Avoid error when printing warning in pythonw (486) - Stream operations can be List or Dict (665) Documentation (DOC): - Added Scripts/pdf-image-extractor.py - Documentation improvements (550, 538, 324, 426, 394) Tests and Test setup (TST): - Add Github Action which automatically run unit tests via pytest and static code analysis with Flake8 (660) - Add several unit tests (661, 663) - Add .coveragerc to create coverage reports Developer Experience Improvements (DEV): - Pre commit: Developers can now `pre-commit install` to avoid tiny issues like trailing whitespaces Miscellaneous: - Add the LICENSE file to the distributed packages (288) - Use setuptools instead of distutils (599) - Improvements for the PyPI page (644) - Python 3 changes (504, 366) You can see the full changelog at: https://github.com/py-pdf/PyPDF2/compare/1.26.0...1.27.0 ```
Links - PyPI: https://pypi.org/project/pypdf2 - Changelog: https://pyup.io/changelogs/pypdf2/ - Docs: https://pypdf2.readthedocs.io/en/latest/
pyup-bot commented 2 years ago

Closing this in favor of #79