issues
search
adbar
/
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.67k
stars
263
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
HTML_TAG_MAPPING error during scrape
#701
beefyandbeef
closed
1 month ago
2
prepare v1.12.2
#700
adbar
closed
2 months ago
1
update docs
#699
adbar
closed
2 months ago
1
Docs: add page explaining how to run tests
#698
adbar
opened
2 months ago
0
Downloads: add support to switch between proxies
#697
adbar
opened
2 months ago
0
Empty Results When Using Spider Function with Category URL
#696
felipehertzer
opened
2 months ago
5
Link on the quickstart page to the overview notebook is broken
#695
cdfuller
closed
2 months ago
1
metadata: review and lint code
#694
adbar
closed
2 months ago
1
ImportError: lxml.html.clean module is now a separate project
#693
regstuff
closed
2 months ago
2
Javascript port of all 35 files
#692
vtempest
closed
2 months ago
1
maintenance: make compression libraries optional
#691
adbar
closed
2 months ago
1
Add max_sitemaps parameter to sitemap_search
#690
felipehertzer
closed
2 months ago
2
build(deps): bump the dependencies group with 4 updates
#689
dependabot[bot]
closed
2 months ago
1
Javascript Version has landed. 🚀
#688
vtempest
closed
1 month ago
3
spider: relax strict parameter for link extraction
#687
adbar
closed
3 months ago
1
extraction fix: ValueError in table spans
#685
adbar
closed
3 months ago
1
Added prune xpath to spider
#684
felipehertzer
closed
3 months ago
9
Add SOCKS Proxy support
#682
gremid
closed
3 months ago
8
ValueError in xml
#681
Honesty-of-the-Cavernous-Tissue
closed
3 months ago
3
Crawler doesn't extract any links from Google Cloud documentation website
#680
Guthman
closed
3 months ago
6
prepare version 1.12.1
#679
adbar
closed
3 months ago
1
Fixed incorrect variable passed to extract_metadata
#678
jpigla
closed
3 months ago
2
CLI: review code, add types and tests
#677
adbar
closed
3 months ago
1
Remove deprecations (mostly CLI)
#676
adbar
closed
1 month ago
0
crawler: add params class
#675
adbar
closed
3 months ago
1
maintenance: simplify link discovery
#674
adbar
closed
3 months ago
1
spider: restrict search to site section targeted by input URL
#673
adbar
closed
3 months ago
1
spider: restrict search to given URL pattern
#672
adbar
closed
3 months ago
0
trafilatura version > 1.10.0 doesnt fetch images
#670
rkiacnhg
closed
3 months ago
3
build(deps): bump the dependencies group with 2 updates
#669
dependabot[bot]
closed
3 months ago
1
robust element deletion: fix AttributeError
#668
adbar
closed
3 months ago
1
AttributeError in prune_unwanted_sections
#667
Honesty-of-the-Cavernous-Tissue
closed
3 months ago
3
How can I set the proxy IP port and userAgent to avoid the web anti-crawler mechanism?
#666
coderwpf
closed
3 months ago
2
table fix: maximum number of header columns
#665
adbar
closed
4 months ago
1
prepare v1.12.0
#664
adbar
closed
4 months ago
1
feat(cli/lib): Add tqdm based progress bar as an option
#663
chitralverma
opened
4 months ago
1
Bug or feature, I'm not sure!
#662
szj2ys
closed
4 months ago
1
Investigate spacing in element tails
#661
adbar
opened
4 months ago
3
Faulty extraction for very short documents
#660
Psynbiotik
opened
4 months ago
4
Duplicating sections, removing spaces between words, simple example
#659
nthomas-whistic
closed
4 months ago
0
table fix: MemoryError & ValueError during conversion to text
#658
adbar
closed
4 months ago
3
MemoryError in table conversion
#657
Honesty-of-the-Cavernous-Tissue
closed
4 months ago
2
formatting & markdown fix: add newlines
#656
adbar
closed
4 months ago
1
XML-TEI: replace RelaxNG by DTD, remove pickle, and update
#655
adbar
closed
4 months ago
0
images fix: use a length threshold on src attribute
#654
adbar
closed
4 months ago
1
extraction: review link and structure checks
#653
adbar
closed
4 months ago
1
extraction: improve justext fallback
#652
adbar
closed
4 months ago
1
Extraction with `include_images=True` takes too much time
#651
Honesty-of-the-Cavernous-Tissue
closed
4 months ago
3
Add magic_html to benchmarks
#650
dantetemplar
closed
3 weeks ago
2
CLI fix: markdown format should trigger include_formatting
#649
adbar
closed
4 months ago
1
Previous
Next