adbar trafilatura issues

adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

https://trafilatura.readthedocs.io

Apache License 2.0

3.67k stars 263 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

HTML_TAG_MAPPING error during scrape

#701 beefyandbeef closed 1 month ago
2
prepare v1.12.2

#700 adbar closed 2 months ago
1
update docs

#699 adbar closed 2 months ago
1
Docs: add page explaining how to run tests

#698 adbar opened 2 months ago
0
Downloads: add support to switch between proxies

#697 adbar opened 2 months ago
0
Empty Results When Using Spider Function with Category URL

#696 felipehertzer opened 2 months ago
5
Link on the quickstart page to the overview notebook is broken

#695 cdfuller closed 2 months ago
1
metadata: review and lint code

#694 adbar closed 2 months ago
1
ImportError: lxml.html.clean module is now a separate project

#693 regstuff closed 2 months ago
2
Javascript port of all 35 files

#692 vtempest closed 2 months ago
1
maintenance: make compression libraries optional

#691 adbar closed 2 months ago
1
Add max_sitemaps parameter to sitemap_search

#690 felipehertzer closed 2 months ago
2
build(deps): bump the dependencies group with 4 updates

#689 dependabot[bot] closed 2 months ago
1
Javascript Version has landed. 🚀

#688 vtempest closed 1 month ago
3
spider: relax strict parameter for link extraction

#687 adbar closed 3 months ago
1
extraction fix: ValueError in table spans

#685 adbar closed 3 months ago
1
Added prune xpath to spider

#684 felipehertzer closed 3 months ago
9
Add SOCKS Proxy support

#682 gremid closed 3 months ago
8
ValueError in xml

#681 Honesty-of-the-Cavernous-Tissue closed 3 months ago
3
Crawler doesn't extract any links from Google Cloud documentation website

#680 Guthman closed 3 months ago
6
prepare version 1.12.1

#679 adbar closed 3 months ago
1
Fixed incorrect variable passed to extract_metadata

#678 jpigla closed 3 months ago
2
CLI: review code, add types and tests

#677 adbar closed 3 months ago
1
Remove deprecations (mostly CLI)

#676 adbar closed 1 month ago
0
crawler: add params class

#675 adbar closed 3 months ago
1
maintenance: simplify link discovery

#674 adbar closed 3 months ago
1
spider: restrict search to site section targeted by input URL

#673 adbar closed 3 months ago
1
spider: restrict search to given URL pattern

#672 adbar closed 3 months ago
0
trafilatura version > 1.10.0 doesnt fetch images

#670 rkiacnhg closed 3 months ago
3
build(deps): bump the dependencies group with 2 updates

#669 dependabot[bot] closed 3 months ago
1
robust element deletion: fix AttributeError

#668 adbar closed 3 months ago
1
AttributeError in prune_unwanted_sections

#667 Honesty-of-the-Cavernous-Tissue closed 3 months ago
3
How can I set the proxy IP port and userAgent to avoid the web anti-crawler mechanism?

#666 coderwpf closed 3 months ago
2
table fix: maximum number of header columns

#665 adbar closed 4 months ago
1
prepare v1.12.0

#664 adbar closed 4 months ago
1
feat(cli/lib): Add tqdm based progress bar as an option

#663 chitralverma opened 4 months ago
1
Bug or feature, I'm not sure!

#662 szj2ys closed 4 months ago
1
Investigate spacing in element tails

#661 adbar opened 4 months ago
3
Faulty extraction for very short documents

#660 Psynbiotik opened 4 months ago
4
Duplicating sections, removing spaces between words, simple example

#659 nthomas-whistic closed 4 months ago
0
table fix: MemoryError & ValueError during conversion to text

#658 adbar closed 4 months ago
3
MemoryError in table conversion

#657 Honesty-of-the-Cavernous-Tissue closed 4 months ago
2
formatting & markdown fix: add newlines

#656 adbar closed 4 months ago
1
XML-TEI: replace RelaxNG by DTD, remove pickle, and update

#655 adbar closed 4 months ago
0
images fix: use a length threshold on src attribute

#654 adbar closed 4 months ago
1
extraction: review link and structure checks

#653 adbar closed 4 months ago
1
extraction: improve justext fallback

#652 adbar closed 4 months ago
1
Extraction with `include_images=True` takes too much time

#651 Honesty-of-the-Cavernous-Tissue closed 4 months ago
3
Add magic_html to benchmarks

#650 dantetemplar closed 3 weeks ago
2
CLI fix: markdown format should trigger include_formatting

#649 adbar closed 4 months ago
1

Previous Next