adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.67k stars 263 forks source link

build(deps): bump the dependencies group with 2 updates #669

Closed dependabot[bot] closed 3 months ago

dependabot[bot] commented 3 months ago

Bumps the dependencies group with 2 updates: trafilatura and news-please.

Updates trafilatura from 1.10.0 to 1.12.0

Release notes

Sourced from trafilatura's releases.

trafilatura-1.12.0

Breaking change:

  • enforce fixed list of output formats, deprecate -out on the CLI (#647)

Faster, more accurate extraction:

  • review link and structure checks (#653)
  • improve justext fallback (#652)
  • baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646)
  • review XPaths for undesirable content (#645)

Bugfixes and maintenance:

  • CLI fix: markdown format should trigger include_formatting (#649)
  • images fix: use a length threshold on src attribute (#654)
  • XML-TEI: replace RelaxNG by DTD, remove pickle, and update (#655)
  • formatting & markdown fix: add newlines (#656)
  • table fix: prevent MemoryError & ValueError during conversion to text (#658)

Documentation:

trafilatura-1.11.0

Breaking change:

  • metadata now skipped by default (#613), to trigger inclusion in all output formats:
    • with_metadata=True (Python)
    • --with-metadata (CLI)

Extraction:

  • add HTML as output format (#614)
  • better and faster baseline extraction (#619)
  • better handling of HTML/XML elements (#628)
  • XPath rules added with @​felipehertzer (#540)
  • fix: avoid faulty readability_lxml content (#635)

Evaluation:

Maintenance:

  • docs extended and updated, added page on deduplication (#618)
  • review code, add tests and types in part of the submodules (#620, #623, #624, #625)
Changelog

Sourced from trafilatura's changelog.

1.12.0

Breaking change:

  • enforce fixed list of output formats, deprecate -out on the CLI (#647)

Faster, more accurate extraction:

  • review link and structure checks (#653)
  • improve justext fallback (#652)
  • baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646)
  • review XPaths for undesirable content (#645)

Bugfixes and maintenance:

  • CLI fix: markdown format should trigger include_formatting (#649)
  • images fix: use a length threshold on src attribute (#654)
  • XML-TEI: replace RelaxNG by DTD, remove pickle, and update (#655)
  • formatting & markdown fix: add newlines (#656)
  • table fix: prevent MemoryError & ValueError during conversion to text (#658)

Documentation:

1.11.0

Breaking change:

  • metadata now skipped by default (#613), to trigger inclusion in all output formats:
    • with_metadata=True (Python)
    • --with-metadata (CLI)

Extraction:

  • add HTML as output format (#614)
  • better and faster baseline extraction (#619)
  • better handling of HTML/XML elements (#628)
  • XPath rules added with @​felipehertzer (#540)
  • fix: avoid faulty readability_lxml content (#635)

Evaluation:

Maintenance:

  • docs extended and updated, added page on deduplication (#618)
  • review code, add tests and types in part of the submodules (#620, #623, #624, #625)
Commits
  • c60395c prepare v1.12.0 (#664)
  • 9338dff main extraction: refactor link and structure filters (#653)
  • 856f4b2 table fix: MemoryError & ValueError during conversion to text (#658)
  • c50f18b formatting & markdown fix: add newlines (#656)
  • e9921b3 XML-TEI: replace pickled RelaxNG by up-to-date DTD file (#655)
  • 0c44b71 extraction: improve and simplify justext fallback (#652)
  • 555639c images fix: use length threshold on src attribute (#654)
  • 7e51a4e extraction: review XPaths for undesirable content (#645)
  • d980024 CLI fix: markdown formats should trigger include_formatting (#649)
  • 30c34a5 output formats: enforce fixed list, deprecate -out on the CLI (#647)
  • Additional commits viewable in compare view


Updates news-please from 1.5.44 to 1.6.13

Commits


Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore ` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore ` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore ` will remove the ignore condition of the specified dependency and ignore conditions
dependabot[bot] commented 3 months ago

This pull request was built based on a group rule. Closing it will not ignore any of these versions in future pull requests.

To ignore these dependencies, configure ignore rules in dependabot.yml