Open adbar opened 4 years ago
Extraction bugs in text and metadata can be listed here as in https://github.com/adbar/htmldate/issues/8 where issues specifically related to dates should be reported.
For details see below.
Words are getting smashed together on this page:
I looked into the extraction code here a bit. The date here is inside a span, which gets stripped, and then the date becomes the tail of the header. All of the whitespace (which includes a newline) gets lost, and then the tail is just directly appended to the header. I'm not sure if the best strategy to fix would be to include a space between the tail and text of the header node when they get extracted, or maybe to look for newlines in the text and somehow respect them. It looks like some of the lxml stuff just strips whitespace automatically when you access "text" and "tail" attributes.
I didn't dig into this one, but I'm guessing it's something similar as the first case. The webpage relies on whitespace that gets stripped by the extraction algorithm.
Yes, I think the issues in the document you mention are related to deleted <span>
sections.
Hey, this is a great library. I was ready to subscribe to a service just to get what this does for me.
For the 30th page I extracted, https://thehill.com/homenews/senate/594044-sen-lujan-to-return-to-senate-in-time-to-vote-for-supreme-court-nominee, Trafilatura 1.0.0 returned only 150 chars of text:
© Greg Nash Luján planning return to Senate in time to vote for Supreme Court nominee By Olafimihan Oshin - 02/13/22 12:54 PM EST Skip to main content
I downloaded the HTML source (lujan.txt) and confirmed it does have the article text in it (starting with "Sen. Ben Ray Luján").
I decided to try the external fallback "Readability". I started Python in my trafilatura container and ran this code:
import lxml
with open('lujan.html') as f:
doc = parse(f).getroot()
x = trafilatura.external.try_readability(doc, "file:///lujan.html")
print(lxml.etree.tostring(x, pretty_print=True, encoding="unicode"))
But that just gave me a bunch of XML/JavaScript that didn't even have the main text in it.
Perhaps a fallback could be added that when extracted text is small and there are large continuous blocks of unextracted text, to grab those instead?
Hi @karlkovaciny, the cutting-edge version from the repository is slightly better, it outputs the article but still includes garbled javascript. That's definitely a case to watch for.
EDIT: for the archived version of the page I now get the same problem as you.
Suggested in #208:
Hey @adbar
I'm having problem with a few publications like huffpost where it is not extracting the metadata correctly.
But, if I change the line bellow to tree = fromstring(htmlobject.encode('utf8'), parser=HTML_PARSER)
it starts to work.
What do you think?
Example: https://bit.ly/3PuvL26 Other example: https://bit.ly/3ai8zEf
Hi @felipehertzer, I don't think I can reproduce the bug, which metadata fields do you mean exactly?
Hello,
URL of testing: https://orientxxi.info/fa Trafilatura version : 1.6.2
import trafilatura
downloaded = trafilatura.fetch_url("https://orientxxi.info/fa")
trafilatura.extract(downloaded, output_format="json")
I am wondering why the title is not the one provided in the HTML element <title>
? Trafilatura returns a long sentence:
{"title": "به زبانهای دیگر Yémen. Une paix qui se fait attendre Laurent Bonnefoy · 21 septembre أوسلو، نموذج للفشل دانيال ليفي · 21 أيلول (سبتمبر) موقع “أوريان 21” يدعوكم للاحتفال بعيد ميلاده العاشر! · 20 أيلول (سبتمبر) Petroleum. Turkey vs. Iraq, but the Kurds are Collateral Victims Benoît Drevet · 20 September El doble estándar de Egipto para acoger a sus “huéspedes” sudaneses Séverine Evanno · 1ro de septiembre Khaled El Qaisi, colpevole di Palestina Cecilia Dalla Negra · 18 settembre", "author": null,....
Thanks!
Hi @kinoute, I think it could be because of a tag mismatch (malformed HTML) just before the text segments: <h2 class="indication">به زبانهای دیگر</h3>
It implies that all that follows is a title.
Please note that the extraction doesn't work as well on homepages in general.
Hi @adbar,
I ran into extraction issues.
URL: https://microsoft.github.io/autogen/docs/Use-Cases/enhanced_inference/
Output: ! d o c t y p e h t m l >
I also tested using htmlttext feature and it didn't work any better. It gave me this output.
h t m l c l a s s = " d o c s - v e r s i o n - c u r r e n t " l a n g = " e n " d i r = " l t r " >
I run scraping of HTML outside of trafilitura. I confirmed that we are getting all of the HTML, but seems like there's something in the HTML that trips the extraction.
I used trafilitura.extract() and passed the html code as string into the function. I tested different settings for the favor_recall and favor_precision arguments. They didn't change the output in any significant way. I also tested using trafilitura.baseline() function and it yielded similar results.
@sepsi77 There are LXML-related issues on MacOS M1, M2 etc. (see also https://github.com/adbar/trafilatura/issues/166). Is it the platform you're using or can you provide more details?
@adbar yes, I'm on M1 MacBook
Did you try building LXML from source?
I can't seem to get it to work. I'm new into this level of tweaking with the system. The installation fails because of missing precompiled Cython files. Trying to run that with the --without-cython flag also doesn't work.
RuntimeError: ERROR: Trying to build without Cython, but pre-generated 'src/lxml/etree.c' is not available (to ignore this error, pass --without-cython or set environment variable WITHOUT_CYTHON=true).
I think I'll just move the script into a Docker container and see if that helps.
@adbar Thanks for your answer on my previous case. I have another one! Doing something like:
trafi_extraction = trafilatura.extract(
response.decode(errors='ignore'),
output_format='json',
include_images=False,
date_extraction_params={
'extensive_search': True,
'original_date': True,
'min_date': EARLIEST_VALID_DATE,
},
include_comments=False,
)
trafilatura_data = trafi_extraction and json.loads(trafi_extraction)
Returns
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 2947 (char 2946)
For this given URL : http://sport.kurganobl.ru/8980.html
trafic_extraction contains :
{"title": null, "author": null, "hostname": null, "date": "2016-12-12", "categories": "", "tags": "", "fingerprint": "6920faf8766bf202", "id": null, "license": null, "comments": null, "raw_text": ", 8 . 350 35 . 1000 1000 , . 21 8 . 7 300 , 01:08:25, . ̀ . 1000 1000 div> \n \n \n \n - \n -2016 \n \n \n \n - \n \n \n \n - \n \n \n \n \n , , ! - \n \n \r\n8\r\n\r\n\r\n1000\r\n1000\r\n \n \n \n \n \n , - \n \n \n II - \n \n - \n \n \n ! \n \n \n ++ = \n \n \n \n \n \r\n8\r\n \r\n\r\n1000\r\n1000\r\n ZauraLife \n 150 \n \n 76- \n 3 : \n 2015 \n \n \n \n \n \n \n \n \n \n \n \n \n \n - \n \n \r\n8\r\n \r\n\r\n1000\r\n1000\r\n - 2016 \n \n \n \n \n \n \n \n ! \n \n \n \n \n - 2016 \n \n \n \n ! \n \n \n \n \n \n \r\n8\r\n \r\n\r\n1000\r\n1000\r\n \n \n \n \n \n - \n \n \n \n \n - \n - 2015 ? \n \n \n \n ( ) \n \n \n \n \n \r\n8\r\n \r\n\r\n1000\r\n1000\r\n \n \n \n \n \n - \n \n \n \n \n II - \n - \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n , , ! \n \n \n \n , \n 2015 \n \n \n 2016 \n \n , , ! \n \n \n \n \n - \n \n \n \n ! \n \n \n \n \n \n . - . \n \n \n <\r\n\r\n1000\r\n1000\r\na href=\"8444.html\" title=\" \"> \n \n \n \n - - 2015 \n \n \n \n \n ZauraLife \n \n \n \n \n \n \n \n \n \n \n XXVII \n \n \n \n ! \n \n \n \n \n XXVII \n \n \n \n - - \n \n \n - \n \n 26 \n ? \n \n ! 2016 \n 38", "text": ", 8 . 350 35 .\n1000 1000 , . 21 8 .\n7 300 , 01:08:25, .\ǹ .\n1000 1000 div>\n-\n-2016\n-\n-\n, , ! -\n8\n1000\n1000\n, -\nII -\n-\n!\n++ =\n8\n1000\n1000\nZauraLife\n150\n76-\n3 :\n2015\n-\n8\n1000\n1000\n- 2016\n!\n- 2016\n!\n8\n1000\n1000\n-\n-\n- 2015 ?\n( )\n8\n1000\n1000\n-\nII -\n-\n, , !\n,\n2015\n2016\n, , !\n-\n!\n. - .\n<\n1000\n1000\na href=\"8444.html\" title=\" \">\n- - 2015\nZauraLife\nXXVII\n!\nXXVII\n- -\n-\n26\n?\n! 2016\n38", "language": null, "image": null, "pagetype": null, "source": null, "source-hostname": null, "excerpt": null}
Edit: Right now I am handling this with this method:
def fix_invalid_escapes(self, s):
# This regex matches a backslash not followed by a valid JSON escape
return re.sub(r'\\(?![/bfnrt"\\u])', r'\\\\', s)
But I think maybe Trafilatura could handle this natively? (I'm not even sure my fix is enough/good)
Hi @kinoute, there must be something wrong in the way you encore or decode the HTML response, I cannot reproduce the bug:
trafilatura -u "http://sport.kurganobl.ru/8980.html" --json
works on my computer.
@sepsi77 Please note that brew can now be used to install Trafilatura on MacOS in a seamless way: https://formulae.brew.sh/formula/trafilatura
Thanks @adbar using brew to install trafilitura fixed the problem.
Hi there, I'm not sure this is the right thread, but here's the problem I'm having. Some sites have more than one <article>
node for a single article: https://conselhos-desportivos.decathlon.pt/guia-de-treino-para-gluteos
The XPath that extracts the text is (.//article)[1]
, so it only extracts the first paragraph. Do you have a solution in mind? Do you think modifying the XPath to retrieve all <articles>
and iterating over them to concatenate them is a good solution?
Hi @hugoobauer, this problem is also mentioned in #432. The problem with taking all article elements is that sometimes they are related content and not main content (e.g. a list of teasers at the end of a page).
IMHO this is an improper use of the <article>
tag but I'm not sure what to do about it: the XPath would have to be changed or a new heuristic on content length added.
Hi @adbar, I completely agree that this is a misuse of <article>
. I'm looking for a way to extract all the "relevant" content from a page, even if I take a bit too much. In this case, retrieving info at the bottom of the page that's more or less related to the article bothers me less than missing the majority of an article's content.
So I made a little POC to test a solution:
(.//article)[1]
to (.//article)
for expr in BODY_XPATH:
# select tree if the expression has been found
try:
subtrees = tree.xpath(expr)
if len(subtrees) > 1: # and favor_recall=True ?
new_subtree = Element(subtrees[0].tag)
for _subtree in subtrees:
for child in _subtree:
# if len(' '.join(child.itertext()).strip()) > MIN_EXTRACTED_SIZE ?
new_subtree.append(child)
subtree = new_subtree
else:
subtree = subtrees[0]
except IndexError:
continue
If there's only one item, it's the same as before. Otherwise, I create a new node of the same tag (article in this case), and I insert in it each child of each of the nodes. In addition, we could check whether the favor_recall option is enabled, so that it's not done by default. And use the MIN_EXTRACTED_SIZE value to extract only those elements that are long enough? What do you think ? I've only been studying the repository for a short time, so I may have missed something.
@hugoobauer Your idea looks good. The length heuristic would have to run on whole <article>
elements and I'm not sure how.
In any case, feel free to draft a pull request for this or for another issue. You can add a test case somewhere in tests/unit_tests.py
and the tests have to pass (realworld_tests.py
are also relevant here). You can also check the benchmark in the tests/
folder to see if performance improves.
Okay great, I will work on a PR soon
Hi @adbar
I am having an issue with this URL - https://www.energyvault.com/about#leaders. I am not able to extract the text from it. Here's the code I am using:
def get_text(url=None, html_text=None):
from trafilatura import bare_extraction, fetch_url
if not url and not html_text:
raise ValueError("Either 'url' or 'html_text' must be provided")
if html_text:
html_string = html_text
else:
url_response = fetch_url(url)
html_string = url_response
extracted_data = bare_extraction(html_string,
include_links=True,
include_formatting=True,
include_images=True,
include_tables=True)
doc_text = extracted_data["text"] if extracted_data else None
return doc_text
if __name__ == "__main__":
url = "https://www.energyvault.com/about#leaders"
text = get_text(url=url)
print(text)
When I debugged it a little bit, I find it throws an exception with the following traceback -
Traceback (most recent call last):
File "/lib/python3.11/site-packages/trafilatura/core.py", line 921, in bare_extraction
document = extract_metadata(tree, url, date_extraction_params, no_fallback, author_blacklist)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/trafilatura/metadata.py", line 535, in extract_metadata
metadata.date = find_date(tree, **date_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/htmldate/core.py", line 986, in find_date
return converted or search_page(htmlstring, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/htmldate/core.py", line 724, in search_page
dateobject = datetime(int(bestmatch[1]), int(bestmatch[2]), 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: month must be in 1..12
Let me know if you need any more info.
Hi @Sang12-2017-18, I cannot reproduce the bug as such but something is odd with this webpage. Do you use the latest version of the trafilatura and htmldate packages? If so, please file an issue on the htmldate repository.
Hi @adbar Thank you for the quick response. I have the latest versions of trafilatura (v1.8.0), and htmldate (v1.8.0). I'll surely file an issue in the htmldate repository. Before that, I wanted to know one thing - for my requirement, extracting the date published from the web page is not necessary. I'm quite okay if the date comes as None, but I want other fields like text, author etc. Is there any configuration option available such that we can exclude dates while extracting, but keep other metadata?
@Sang12-2017-18 So far there is no such option. I still cannot reproduce the error, how did you get the traceback?
@Sang12-2017-18 the bug is now fixed in Htmldate version 1.8.1. As for the option to bypass metadata extraction I'm going to add it to the to do list.
The web page I am trying to parse is not modern one and uses Russian language. Not sure if it worth the efforts to support the parsing of such pages. However, I report it here.
Screenshot of absent content (including content on other tabs) I am interested in:
I tried with CLI:
trafilatura -u "https://engelsky--sar.sudrf.ru/modules.php?name=sud_delo&srv_num=1&name_op=case&case_id=256803033&case_uid=6faab311-51a0-4d06-aa03-6293266f991f&result=0&delo_id=1540005&new=" --no-comments --recall --xmltei
I have mostly tested
trafilatura
on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in xpaths.py (see
BODY_XPATH
andCOMMENTS_XPATH
lists).Thanks!