issues
search
adbar
/
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.66k
stars
262
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Performance bottleneck in `prune_unwanted_nodes` causing 200ms per call
#750
thsunkid
opened
1 day ago
1
Review input type for `is_probably_readerable()` function
#749
adbar
opened
2 days ago
0
type hinting: add remaining types and integrate into CI
#748
adbar
closed
2 days ago
1
CLI: add 126 exit code for high error ratio
#747
adbar
closed
3 days ago
1
Documentation about settings could use examples
#746
georgedorn
opened
1 week ago
1
docs: general update
#745
adbar
closed
6 days ago
1
cli: also stream URL list gathered from feeds
#744
gremid
closed
1 week ago
1
docs: remove from published packages
#743
adbar
closed
1 week ago
1
extraction: move max_tree_size parameter to settings.cfg
#742
adbar
closed
1 week ago
1
Extraction: move `max_tree_size` to config file
#741
adbar
closed
1 week ago
0
setup: explicit exports through `__all__`
#740
adbar
closed
2 weeks ago
1
Extracting full text from an URL returns None
#739
vrnch
opened
2 weeks ago
2
Explicitly and fully support type hinting
#738
adbar
closed
2 days ago
0
build(deps): bump the dependencies group with 5 updates
#737
dependabot[bot]
closed
3 weeks ago
1
downloads: cleaner urllib3 code
#736
adbar
closed
3 weeks ago
1
downloads: better urllib3 setup
#735
adbar
closed
3 weeks ago
0
CLI downloads: use all information in settings file
#734
adbar
closed
3 weeks ago
1
Downloads: fully use information from both `config` and `options` variables
#733
adbar
closed
3 weeks ago
0
CLI downloads: make sure all user-specified options are used
#732
andyskipper
closed
3 weeks ago
4
evaluation: review data, update packages, add magic_html
#731
adbar
closed
2 weeks ago
1
extraction: deprecate no_fallback and as_dict parameters
#730
adbar
closed
3 weeks ago
1
`bare_extraction()`: deprecate `as_dict` parameter
#729
adbar
closed
3 weeks ago
0
typing: fix mypy errors
#728
adbar
closed
1 month ago
1
simplify trim() function
#727
adbar
closed
1 month ago
1
Focused crawler returns 404 response for robots.txt and stops crawling
#726
Guthman
closed
1 month ago
1
`extract()`: replace `no_fallback` argument by `fast`
#725
adbar
closed
3 weeks ago
0
downloads: remove `decode` argument in `fetch_url()`
#724
adbar
closed
1 month ago
1
refactoring: add type hints
#723
adbar
closed
1 month ago
1
Deprecate `fetch_url(decode=False)`
#722
adbar
closed
1 month ago
0
fix: more robust mapping for conversion to HTML
#721
adbar
closed
1 month ago
1
Review HTML element list and conversion
#720
adbar
opened
1 month ago
0
setup: set `__all__` in `__init__.py`
#718
adbar
closed
2 weeks ago
0
fix: robust encoding in options.source
#717
adbar
closed
1 month ago
1
breaking: remove deprecated functions and args
#716
adbar
closed
1 month ago
1
setup: use pyproject.toml file
#715
adbar
closed
1 month ago
1
logging: better debug messages in main_extractor
#714
adbar
closed
1 month ago
1
setup: deprecate current GUI
#713
adbar
closed
1 month ago
1
setup: use `pyproject.toml` file
#712
adbar
closed
1 month ago
0
Use rst link instead of markdown link in `docs/index.html`
#711
nzw0301
closed
1 month ago
1
metadata: more robust URL extraction
#710
adbar
closed
1 month ago
1
maintenance: deprecate 3.6 & 3.7 and simplify code base
#709
adbar
closed
1 month ago
1
maintenance: remove superfluous RuntimeError catch
#708
adbar
closed
1 month ago
1
fix: set options.source before raising error on empty doc tree
#707
rgeronimi
closed
1 month ago
2
build(deps): bump the dependencies group with 5 updates
#706
dependabot[bot]
closed
1 month ago
1
Trafilatura crashing due to `options` variable not backfilled yet
#705
rgeronimi
closed
1 month ago
1
extract function runs indefinitely on large HTML body content
#704
hitesh1997
closed
1 month ago
1
Download multiple urls with download timeout
#703
vodkaslime
closed
1 week ago
2
I can't extract main content from this html,could anyone help me?
#702
CNXDZS
closed
1 month ago
1
HTML_TAG_MAPPING error during scrape
#701
beefyandbeef
closed
1 month ago
2
prepare v1.12.2
#700
adbar
closed
2 months ago
1
Next