adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.65k stars 261 forks source link

List of smaller extraction bugs (text & metadata) #4

Open adbar opened 4 years ago

adbar commented 4 years ago

I have mostly tested trafilatura on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in xpaths.py (see BODY_XPATH and COMMENTS_XPATH lists).

Thanks!

adbar commented 2 years ago

Extraction bugs in text and metadata can be listed here as in https://github.com/adbar/htmldate/issues/8 where issues specifically related to dates should be reported.

For details see below.

cheezman34 commented 2 years ago

Words are getting smashed together on this page:

https://research.checkpoint.com/2021/a-deep-dive-into-doublefeature-equation-groups-post-exploitation-dashboard/

Screen Shot 2022-02-11 at 11 42 46 AM Screen Shot 2022-02-11 at 11 44 11 AM

I looked into the extraction code here a bit. The date here is inside a span, which gets stripped, and then the date becomes the tail of the header. All of the whitespace (which includes a newline) gets lost, and then the tail is just directly appended to the header. I'm not sure if the best strategy to fix would be to include a space between the tail and text of the header node when they get extracted, or maybe to look for newlines in the text and somehow respect them. It looks like some of the lxml stuff just strips whitespace automatically when you access "text" and "tail" attributes.

Screen Shot 2022-02-11 at 11 44 51 AM Screen Shot 2022-02-11 at 11 45 15 AM

I didn't dig into this one, but I'm guessing it's something similar as the first case. The webpage relies on whitespace that gets stripped by the extraction algorithm.

adbar commented 2 years ago

Yes, I think the issues in the document you mention are related to deleted <span> sections.

karlkovaciny commented 2 years ago

Hey, this is a great library. I was ready to subscribe to a service just to get what this does for me.

For the 30th page I extracted, https://thehill.com/homenews/senate/594044-sen-lujan-to-return-to-senate-in-time-to-vote-for-supreme-court-nominee, Trafilatura 1.0.0 returned only 150 chars of text:

© Greg Nash Luján planning return to Senate in time to vote for Supreme Court nominee By Olafimihan Oshin - 02/13/22 12:54 PM EST Skip to main content

I downloaded the HTML source (lujan.txt) and confirmed it does have the article text in it (starting with "Sen. Ben Ray Luján").

I decided to try the external fallback "Readability". I started Python in my trafilatura container and ran this code:

import lxml
with open('lujan.html') as f:
     doc = parse(f).getroot()
     x = trafilatura.external.try_readability(doc, "file:///lujan.html")
     print(lxml.etree.tostring(x, pretty_print=True, encoding="unicode"))

But that just gave me a bunch of XML/JavaScript that didn't even have the main text in it.

Perhaps a fallback could be added that when extracted text is small and there are large continuous blocks of unextracted text, to grab those instead?

adbar commented 2 years ago

Hi @karlkovaciny, the cutting-edge version from the repository is slightly better, it outputs the article but still includes garbled javascript. That's definitely a case to watch for.

EDIT: for the archived version of the page I now get the same problem as you.

adbar commented 2 years ago

Suggested in #208:

felipehertzer commented 2 years ago

Hey @adbar

I'm having problem with a few publications like huffpost where it is not extracting the metadata correctly. But, if I change the line bellow to tree = fromstring(htmlobject.encode('utf8'), parser=HTML_PARSER) it starts to work. What do you think?

https://github.com/adbar/trafilatura/blob/168e660514a2ced3f7e902cd50476010f33d2337/trafilatura/utils.py#L177

Example: https://bit.ly/3PuvL26 Other example: https://bit.ly/3ai8zEf

adbar commented 2 years ago

Hi @felipehertzer, I don't think I can reproduce the bug, which metadata fields do you mean exactly?

kinoute commented 1 year ago

Hello,

URL of testing: https://orientxxi.info/fa Trafilatura version : 1.6.2

import trafilatura
downloaded = trafilatura.fetch_url("https://orientxxi.info/fa")
trafilatura.extract(downloaded, output_format="json")

I am wondering why the title is not the one provided in the HTML element <title>? Trafilatura returns a long sentence:

{"title": "به زبانهای دیگر Yémen. Une paix qui se fait attendre Laurent Bonnefoy · 21 septembre أوسلو، نموذج للفشل دانيال ليفي · 21 أيلول (سبتمبر) موقع “أوريان 21” يدعوكم للاحتفال بعيد ميلاده العاشر! · 20 أيلول (سبتمبر) Petroleum. Turkey vs. Iraq, but the Kurds are Collateral Victims Benoît Drevet · 20 September El doble estándar de Egipto para acoger a sus “huéspedes” sudaneses Séverine Evanno · 1ro de septiembre Khaled El Qaisi, colpevole di Palestina Cecilia Dalla Negra · 18 settembre", "author": null,....

Thanks!

adbar commented 1 year ago

Hi @kinoute, I think it could be because of a tag mismatch (malformed HTML) just before the text segments: <h2 class="indication">به زبانهای دیگر</h3> It implies that all that follows is a title.

Please note that the extraction doesn't work as well on homepages in general.

sepsi77 commented 1 year ago

Hi @adbar,

I ran into extraction issues.

URL: https://microsoft.github.io/autogen/docs/Use-Cases/enhanced_inference/

Output: ! d o c t y p e h t m l >

I also tested using htmlttext feature and it didn't work any better. It gave me this output.

h t m l c l a s s = " d o c s - v e r s i o n - c u r r e n t " l a n g = " e n " d i r = " l t r " >

I run scraping of HTML outside of trafilitura. I confirmed that we are getting all of the HTML, but seems like there's something in the HTML that trips the extraction.

I used trafilitura.extract() and passed the html code as string into the function. I tested different settings for the favor_recall and favor_precision arguments. They didn't change the output in any significant way. I also tested using trafilitura.baseline() function and it yielded similar results.

adbar commented 1 year ago

@sepsi77 There are LXML-related issues on MacOS M1, M2 etc. (see also https://github.com/adbar/trafilatura/issues/166). Is it the platform you're using or can you provide more details?

sepsi77 commented 1 year ago

@adbar yes, I'm on M1 MacBook

adbar commented 1 year ago

Did you try building LXML from source?

sepsi77 commented 1 year ago

I can't seem to get it to work. I'm new into this level of tweaking with the system. The installation fails because of missing precompiled Cython files. Trying to run that with the --without-cython flag also doesn't work.

RuntimeError: ERROR: Trying to build without Cython, but pre-generated 'src/lxml/etree.c' is not available (to ignore this error, pass --without-cython or set environment variable WITHOUT_CYTHON=true).

I think I'll just move the script into a Docker container and see if that helps.

kinoute commented 1 year ago

@adbar Thanks for your answer on my previous case. I have another one! Doing something like:

        trafi_extraction = trafilatura.extract(
            response.decode(errors='ignore'),
            output_format='json',
            include_images=False,
            date_extraction_params={
                'extensive_search': True,
                'original_date': True,
                'min_date': EARLIEST_VALID_DATE,
            },
            include_comments=False,
        )

        trafilatura_data = trafi_extraction and json.loads(trafi_extraction)

Returns

json.decoder.JSONDecodeError: Invalid \escape: line 1 column 2947 (char 2946)

For this given URL : http://sport.kurganobl.ru/8980.html

trafic_extraction contains :

{"title": null, "author": null, "hostname": null, "date": "2016-12-12", "categories": "", "tags": "", "fingerprint": "6920faf8766bf202", "id": null, "license": null, "comments": null, "raw_text": ", 8 . 350 35 . 1000 1000 , . 21 8 .  7 300 , 01:08:25, .  ̀ .  1000 1000 div>    \n      \n         \n      \n -    \n         -2016   \n        \n     \n    \n        - \n    \n          \n        \n   -      \n        \n           \n     \n        \n , , ! -      \n       \n     \r\n8\r\n\r\n\r\n1000\r\n1000\r\n      \n     \n       \n          \n        \n ,    -     \n    \n        \n     II  -    \n        \n   - \n     \n       \n    ! \n          \n           \n ++ =     \n         \n         \n       \n      \n  \r\n8\r\n \r\n\r\n1000\r\n1000\r\n     ZauraLife \n      150     \n       \n     76-        \n 3     :   \n           2015  \n   \n             \n    \n       \n      \n        \n     \n          \n       \n       \n     \n     \n   \n     -    \n          \n  \r\n8\r\n \r\n\r\n1000\r\n1000\r\n - 2016  \n           \n          \n     \n       \n             \n       \n    \n     ! \n         \n    \n     \n       \n    - 2016     \n       \n     \n     \n     ! \n     \n      \n    \n    \n     \n   \r\n8\r\n \r\n\r\n1000\r\n1000\r\n \n     \n    \n      \n    \n   - \n       \n           \n        \n   \n          - \n     -   2015 ? \n          \n        \n             \n        ( ) \n             \n        \n            \n         \n   \r\n8\r\n \r\n\r\n1000\r\n1000\r\n      \n       \n       \n    \n        \n -      \n          \n     \n         \n          \n    II  -   \n         -  \n       \n      \n         \n          \n    \n      \n           \n    \n          \n         \n                \n       \n           \n       \n      \n            \n       \n            \n          \n      \n        \n         \n            \n       \n          \n        \n      \n       \n      \n         \n           \n          \n           \n         \n            \n      \n     \n     \n       \n     \n           \n        \n , ,   !  \n        \n        \n     \n  ,     \n   2015       \n             \n             \n        2016      \n       \n       , ,   ! \n        \n          \n      \n          \n -     \n           \n    \n     \n     ! \n     \n      \n           \n       \n       \n    .   - . \n           \n         \n <\r\n\r\n1000\r\n1000\r\na href=\"8444.html\" title=\"  \">   \n      \n          \n         \n  -    -    2015  \n         \n         \n    \n          \n         ZauraLife \n    \n       \n      \n      \n   \n    \n    \n      \n     \n         \n XXVII      \n        \n          \n          \n !      \n    \n         \n         \n     \n XXVII       \n     \n        \n        \n      -   -    \n            \n          \n   -  \n       \n      26  \n        ? \n          \n !      2016 \n   38", "text": ", 8 . 350 35 .\n1000 1000 , . 21 8 .\n7 300 , 01:08:25, .\ǹ .\n1000 1000 div>\n-\n-2016\n-\n-\n, , ! -\n8\n1000\n1000\n, -\nII -\n-\n!\n++ =\n8\n1000\n1000\nZauraLife\n150\n76-\n3 :\n2015\n-\n8\n1000\n1000\n- 2016\n!\n- 2016\n!\n8\n1000\n1000\n-\n-\n- 2015 ?\n( )\n8\n1000\n1000\n-\nII -\n-\n, , !\n,\n2015\n2016\n, , !\n-\n!\n. - .\n<\n1000\n1000\na href=\"8444.html\" title=\" \">\n- - 2015\nZauraLife\nXXVII\n!\nXXVII\n- -\n-\n26\n?\n! 2016\n38", "language": null, "image": null, "pagetype": null, "source": null, "source-hostname": null, "excerpt": null}

Edit: Right now I am handling this with this method:

    def fix_invalid_escapes(self, s):
        # This regex matches a backslash not followed by a valid JSON escape
        return re.sub(r'\\(?![/bfnrt"\\u])', r'\\\\', s)

But I think maybe Trafilatura could handle this natively? (I'm not even sure my fix is enough/good)

adbar commented 1 year ago

Hi @kinoute, there must be something wrong in the way you encore or decode the HTML response, I cannot reproduce the bug: trafilatura -u "http://sport.kurganobl.ru/8980.html" --json works on my computer.

adbar commented 1 year ago

@sepsi77 Please note that brew can now be used to install Trafilatura on MacOS in a seamless way: https://formulae.brew.sh/formula/trafilatura

sepsi77 commented 1 year ago

Thanks @adbar using brew to install trafilitura fixed the problem.

hugoobauer commented 10 months ago

Hi there, I'm not sure this is the right thread, but here's the problem I'm having. Some sites have more than one <article> node for a single article: https://conselhos-desportivos.decathlon.pt/guia-de-treino-para-gluteos

The XPath that extracts the text is (.//article)[1], so it only extracts the first paragraph. Do you have a solution in mind? Do you think modifying the XPath to retrieve all <articles> and iterating over them to concatenate them is a good solution?

adbar commented 10 months ago

Hi @hugoobauer, this problem is also mentioned in #432. The problem with taking all article elements is that sometimes they are related content and not main content (e.g. a list of teasers at the end of a page). IMHO this is an improper use of the <article> tag but I'm not sure what to do about it: the XPath would have to be changed or a new heuristic on content length added.

hugoobauer commented 10 months ago

Hi @adbar, I completely agree that this is a misuse of <article>. I'm looking for a way to extract all the "relevant" content from a page, even if I take a bit too much. In this case, retrieving info at the bottom of the page that's more or less related to the article bothers me less than missing the majority of an article's content.

So I made a little POC to test a solution:

If there's only one item, it's the same as before. Otherwise, I create a new node of the same tag (article in this case), and I insert in it each child of each of the nodes. In addition, we could check whether the favor_recall option is enabled, so that it's not done by default. And use the MIN_EXTRACTED_SIZE value to extract only those elements that are long enough? What do you think ? I've only been studying the repository for a short time, so I may have missed something.

adbar commented 10 months ago

@hugoobauer Your idea looks good. The length heuristic would have to run on whole <article> elements and I'm not sure how.

In any case, feel free to draft a pull request for this or for another issue. You can add a test case somewhere in tests/unit_tests.py and the tests have to pass (realworld_tests.py are also relevant here). You can also check the benchmark in the tests/ folder to see if performance improves.

hugoobauer commented 10 months ago

Okay great, I will work on a PR soon

Sang12-2017-18 commented 7 months ago

Hi @adbar

I am having an issue with this URL - https://www.energyvault.com/about#leaders. I am not able to extract the text from it. Here's the code I am using:

def get_text(url=None, html_text=None):
    from trafilatura import bare_extraction, fetch_url
    if not url and not html_text:
        raise ValueError("Either 'url' or 'html_text' must be provided")
    if html_text:
        html_string = html_text
    else:
        url_response = fetch_url(url)
        html_string = url_response
    extracted_data = bare_extraction(html_string,
                                     include_links=True,
                                     include_formatting=True,
                                     include_images=True,
                                     include_tables=True)
    doc_text = extracted_data["text"] if extracted_data else None
    return doc_text

if __name__ == "__main__":
    url = "https://www.energyvault.com/about#leaders"
    text = get_text(url=url)
    print(text)

When I debugged it a little bit, I find it throws an exception with the following traceback -

Traceback (most recent call last):
  File "/lib/python3.11/site-packages/trafilatura/core.py", line 921, in bare_extraction
    document = extract_metadata(tree, url, date_extraction_params, no_fallback, author_blacklist)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/trafilatura/metadata.py", line 535, in extract_metadata
    metadata.date = find_date(tree, **date_config)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/htmldate/core.py", line 986, in find_date
    return converted or search_page(htmlstring, options)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/htmldate/core.py", line 724, in search_page
    dateobject = datetime(int(bestmatch[1]), int(bestmatch[2]), 1)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: month must be in 1..12

Let me know if you need any more info.

adbar commented 7 months ago

Hi @Sang12-2017-18, I cannot reproduce the bug as such but something is odd with this webpage. Do you use the latest version of the trafilatura and htmldate packages? If so, please file an issue on the htmldate repository.

Sang12-2017-18 commented 7 months ago

Hi @adbar Thank you for the quick response. I have the latest versions of trafilatura (v1.8.0), and htmldate (v1.8.0). I'll surely file an issue in the htmldate repository. Before that, I wanted to know one thing - for my requirement, extracting the date published from the web page is not necessary. I'm quite okay if the date comes as None, but I want other fields like text, author etc. Is there any configuration option available such that we can exclude dates while extracting, but keep other metadata?

adbar commented 7 months ago

@Sang12-2017-18 So far there is no such option. I still cannot reproduce the error, how did you get the traceback?

adbar commented 7 months ago

@Sang12-2017-18 the bug is now fixed in Htmldate version 1.8.1. As for the option to bypass metadata extraction I'm going to add it to the to do list.

tysonite commented 1 month ago

The web page I am trying to parse is not modern one and uses Russian language. Not sure if it worth the efforts to support the parsing of such pages. However, I report it here.

URL: https://engelsky--sar.sudrf.ru/modules.php?name=sud_delo&srv_num=1&name_op=case&case_id=256803033&case_uid=6faab311-51a0-4d06-aa03-6293266f991f&result=0&delo_id=1540005&new=

Screenshot of absent content (including content on other tabs) I am interested in:

image

I tried with CLI:

trafilatura -u "https://engelsky--sar.sudrf.ru/modules.php?name=sud_delo&srv_num=1&name_op=case&case_id=256803033&case_uid=6faab311-51a0-4d06-aa03-6293266f991f&result=0&delo_id=1540005&new=" --no-comments --recall --xmltei