adbar trafilatura issues

adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

https://trafilatura.readthedocs.io

Apache License 2.0

3.66k stars 262 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Performance bottleneck in `prune_unwanted_nodes` causing 200ms per call

#750 thsunkid opened 1 day ago
1
Review input type for `is_probably_readerable()` function

#749 adbar opened 2 days ago
0
type hinting: add remaining types and integrate into CI

#748 adbar closed 2 days ago
1
CLI: add 126 exit code for high error ratio

#747 adbar closed 3 days ago
1
Documentation about settings could use examples

#746 georgedorn opened 1 week ago
1
docs: general update

#745 adbar closed 6 days ago
1
cli: also stream URL list gathered from feeds

#744 gremid closed 1 week ago
1
docs: remove from published packages

#743 adbar closed 1 week ago
1
extraction: move max_tree_size parameter to settings.cfg

#742 adbar closed 1 week ago
1
Extraction: move `max_tree_size` to config file

#741 adbar closed 1 week ago
0
setup: explicit exports through `__all__`

#740 adbar closed 2 weeks ago
1
Extracting full text from an URL returns None

#739 vrnch opened 2 weeks ago
2
Explicitly and fully support type hinting

#738 adbar closed 2 days ago
0
build(deps): bump the dependencies group with 5 updates

#737 dependabot[bot] closed 3 weeks ago
1
downloads: cleaner urllib3 code

#736 adbar closed 3 weeks ago
1
downloads: better urllib3 setup

#735 adbar closed 3 weeks ago
0
CLI downloads: use all information in settings file

#734 adbar closed 3 weeks ago
1
Downloads: fully use information from both `config` and `options` variables

#733 adbar closed 3 weeks ago
0
CLI downloads: make sure all user-specified options are used

#732 andyskipper closed 3 weeks ago
4
evaluation: review data, update packages, add magic_html

#731 adbar closed 2 weeks ago
1
extraction: deprecate no_fallback and as_dict parameters

#730 adbar closed 3 weeks ago
1
`bare_extraction()`: deprecate `as_dict` parameter

#729 adbar closed 3 weeks ago
0
typing: fix mypy errors

#728 adbar closed 1 month ago
1
simplify trim() function

#727 adbar closed 1 month ago
1
Focused crawler returns 404 response for robots.txt and stops crawling

#726 Guthman closed 1 month ago
1
`extract()`: replace `no_fallback` argument by `fast`

#725 adbar closed 3 weeks ago
0
downloads: remove `decode` argument in `fetch_url()`

#724 adbar closed 1 month ago
1
refactoring: add type hints

#723 adbar closed 1 month ago
1
Deprecate `fetch_url(decode=False)`

#722 adbar closed 1 month ago
0
fix: more robust mapping for conversion to HTML

#721 adbar closed 1 month ago
1
Review HTML element list and conversion

#720 adbar opened 1 month ago
0
setup: set `__all__` in `__init__.py`

#718 adbar closed 2 weeks ago
0
fix: robust encoding in options.source

#717 adbar closed 1 month ago
1
breaking: remove deprecated functions and args

#716 adbar closed 1 month ago
1
setup: use pyproject.toml file

#715 adbar closed 1 month ago
1
logging: better debug messages in main_extractor

#714 adbar closed 1 month ago
1
setup: deprecate current GUI

#713 adbar closed 1 month ago
1
setup: use `pyproject.toml` file

#712 adbar closed 1 month ago
0
Use rst link instead of markdown link in `docs/index.html`

#711 nzw0301 closed 1 month ago
1
metadata: more robust URL extraction

#710 adbar closed 1 month ago
1
maintenance: deprecate 3.6 & 3.7 and simplify code base

#709 adbar closed 1 month ago
1
maintenance: remove superfluous RuntimeError catch

#708 adbar closed 1 month ago
1
fix: set options.source before raising error on empty doc tree

#707 rgeronimi closed 1 month ago
2
build(deps): bump the dependencies group with 5 updates

#706 dependabot[bot] closed 1 month ago
1
Trafilatura crashing due to `options` variable not backfilled yet

#705 rgeronimi closed 1 month ago
1
extract function runs indefinitely on large HTML body content

#704 hitesh1997 closed 1 month ago
1
Download multiple urls with download timeout

#703 vodkaslime closed 1 week ago
2
I can't extract main content from this html,could anyone help me?

#702 CNXDZS closed 1 month ago
1
HTML_TAG_MAPPING error during scrape

#701 beefyandbeef closed 1 month ago
2
prepare v1.12.2

#700 adbar closed 2 months ago
1