adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.66k stars 262 forks source link

build(deps): bump the dependencies group with 5 updates #737

Closed dependabot[bot] closed 3 weeks ago

dependabot[bot] commented 3 weeks ago

Bumps the dependencies group with 5 updates:

Package From To
trafilatura 1.10.0 1.12.2
pandas 2.2.2 2.2.3
tqdm 4.66.4 4.66.6
news-please 1.5.44 1.6.13
resiliparse 0.14.7 0.14.9

Updates trafilatura from 1.10.0 to 1.12.2

Release notes

Sourced from trafilatura's releases.

trafilatura-1.12.2

  • downloads: add support for SOCKS proxies with @​gremid (#682)
  • extraction fix: ValueError in table spans (#685)
  • spider: prune_xpath parameter added by @​felipehertzer (#684)
  • spider: relax strict parameter for link extraction (#687)
  • sitemaps: max_sitemaps parameter added by @​felipehertzer (#690)
  • maintenance: make compression libraries optional (#691)
  • metadata: review and lint code (#694)

trafilatura-1.12.1

Navigation:

  • spider: restrict search to sections containing URL path (#673)
  • crawler: add parameter class and types, breaking change for undocumented functions (#675)
  • maintenance: simplify link discovery and extend tests (#674)
  • CLI: review code, add types and tests (#677)

Bugfixes:

  • fix AttributeError in element deletion (#668)
  • fix MemoryError in table header columns (#665)

Docs:

  • docs: fix variable name for extract_metadata in quickstart by @​jpigla in #678

trafilatura-1.12.0

Breaking change:

  • enforce fixed list of output formats, deprecate -out on the CLI (#647)

Faster, more accurate extraction:

  • review link and structure checks (#653)
  • improve justext fallback (#652)
  • baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646)
  • review XPaths for undesirable content (#645)

Bugfixes and maintenance:

  • CLI fix: markdown format should trigger include_formatting (#649)
  • images fix: use a length threshold on src attribute (#654)
  • XML-TEI: replace RelaxNG by DTD, remove pickle, and update (#655)
  • formatting & markdown fix: add newlines (#656)
  • table fix: prevent MemoryError & ValueError during conversion to text (#658)

Documentation:

trafilatura-1.11.0

Breaking change:

  • metadata now skipped by default (#613), to trigger inclusion in all output formats:
    • with_metadata=True (Python)
    • --with-metadata (CLI)

... (truncated)

Changelog

Sourced from trafilatura's changelog.

1.12.2

  • downloads: add support for SOCKS proxies with @​gremid (#682)
  • extraction fix: ValueError in table spans (#685)
  • spider: prune_xpath parameter added by @​felipehertzer (#684)
  • spider: relax strict parameter for link extraction (#687)
  • sitemaps: max_sitemaps parameter added by @​felipehertzer (#690)
  • maintenance: make compression libraries optional (#691)
  • metadata: review and lint code (#694)

1.12.1

Navigation:

  • spider: restrict search to sections containing URL path (#673)
  • crawler: add parameter class and types, breaking change for undocumented functions (#675)
  • maintenance: simplify link discovery and extend tests (#674)
  • CLI: review code, add types and tests (#677)

Bugfixes:

  • fix AttributeError in element deletion (#668)
  • fix MemoryError in table header columns (#665)

Docs:

  • docs: fix variable name for extract_metadata in quickstart by @​jpigla in #678

1.12.0

Breaking change:

  • enforce fixed list of output formats, deprecate -out on the CLI (#647)

Faster, more accurate extraction:

  • review link and structure checks (#653)
  • improve justext fallback (#652)
  • baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646)
  • review XPaths for undesirable content (#645)

Bugfixes and maintenance:

  • CLI fix: markdown format should trigger include_formatting (#649)
  • images fix: use a length threshold on src attribute (#654)
  • XML-TEI: replace RelaxNG by DTD, remove pickle, and update (#655)
  • formatting & markdown fix: add newlines (#656)
  • table fix: prevent MemoryError & ValueError during conversion to text (#658)

Documentation:

1.11.0

... (truncated)

Commits


Updates pandas from 2.2.2 to 2.2.3

Release notes

Sourced from pandas's releases.

Pandas 2.2.3

We are pleased to announce the release of pandas 2.2.3. This release includes some new features, bug fixes, and performance improvements. We recommend that all users upgrade to this version.

See the full whatsnew for a list of all the changes. Pandas 2.2.3 supports Python 3.9 and higher.

The release will be available on the defaults and conda-forge channels:

conda install pandas

Or via PyPI:

python3 -m pip install --upgrade pandas

Please report any issues with the release on the pandas issue tracker.

Thanks to all the contributors who made this release possible.

Commits


Updates tqdm from 4.66.4 to 4.66.6

Release notes

Sourced from tqdm's releases.

tqdm v4.66.6 stable

  • cli: zip-safe --manpath, --comppath (#1627)
  • misc framework updates (#1627)
    • fix pytest DeprecationWarning
    • fix snapcraft build
    • fix nbval DeprecationWarning
    • update & tidy workflows
    • bump pre-commit
    • docs: update URLs

tqdm v4.66.5 stable

Commits


Updates news-please from 1.5.44 to 1.6.13

Commits


Updates resiliparse from 0.14.7 to 0.14.9

Commits
  • 3367e55 Bump version number
  • ea7dceb Include py.typed file & fix and/or add missing/wrong type hints in stub files...
  • 066e5a1 Remove unnecessary lvalue allocation
  • d9f785a Set PR number and SHA hash in Codecov upload
  • 774e33f Add explicit all to modules
  • 360a19c Fix warnings
  • feb6cce Use PR label as branch name
  • f91eb4c Don't trigger Docker rebuild on tags
  • 3f018eb Bump version number
  • 06e3592 Update action versions
  • Additional commits viewable in compare view


Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore ` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore ` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore ` will remove the ignore condition of the specified dependency and ignore conditions
dependabot[bot] commented 3 weeks ago

This pull request was built based on a group rule. Closing it will not ignore any of these versions in future pull requests.

To ignore these dependencies, configure ignore rules in dependabot.yml