adbar / trafilatura

Python & command-line tool to gather text on the Web: Crawling & scraping, content extraction, metadata. TXT, Markdown, CSV & XML output.
https://trafilatura.readthedocs.io
Apache License 2.0
3.22k stars 240 forks source link

Sometimes, html tags remain on the string #627

Open masylum opened 3 weeks ago

masylum commented 3 weeks ago

I'm using

from trafilatura import extract
output = extract(
  input,
  include_comments=False,
  include_tables=False,
  no_fallback=True,
)

INPUT:

<p>It&rsquo;s 2AM. You&rsquo;re paged to respond to a failing set of components that you are the Subject Matter Expert (SME) for. Sleepy, you load up the playbook for when the <code>SplineReticulatorBlocked</code> alert has gone off, and start executing. The Incident Commander (IC) is vaguely aware of what you are doing, and checks in now and then.</p>

OUTPUT:

<p>It&rsquo;s 2AM. You&rsquo;re paged to respond to a failing set of components that you are the Subject Matter Expert (SME) for. Sleepy, you load up the playbook for when the <code>SplineReticulatorBlocked</code> alert has gone off, and start executing. The Incident Commander (IC) is vaguely aware of what you are doing, and checks in now and then.</p>

Some other strings, for some reason, do remove the html correctly:

INPUT:

<p>Much of my current job is maintaining and enhancing control planes for <a href="https://www.heroku.com/managed-data-services">Heroku&rsquo;s managed data services</a>. This post explores three patterns used to reduce operational burden and increase system safety and resiliency: <strong>state machines</strong> (and associated state-transition tables), <strong>transducers</strong> and <strong>re-entrant and idempotent</strong> operations.</p>

OUTPUT:

Much of my current job is maintaining and enhancing control planes for Heroku’s managed data services. This post explores three patterns used to reduce operational burden and increase system safety and resiliency: state machines (and associated state-transition tables), transducers and re-entrant and idempotent operations.
adbar commented 3 weeks ago

Thanks for the detailed description, it seems to be a bug indeed.

adbar commented 1 hour ago

@masylum I need more context to reproduce the bug, the following HTML is not enough, there is no HTML code in the output.

Could you please try to make the bug reproducible?

<html>
<body>
<article>
<p>It&rsquo;s 2AM. You&rsquo;re paged to respond to a failing set of components that you are the Subject Matter Expert (SME) for. Sleepy, you load up the playbook for when the <code>SplineReticulatorBlocked</code> alert has gone off, and start executing. The Incident Commander (IC) is vaguely aware of what you are doing, and checks in now and then.</p>
</article>
</body>
</html>