adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.62k stars 258 forks source link

No metadata extraction #27

Closed phongtnit closed 3 years ago

phongtnit commented 3 years ago

Hello,

Thanks for your beautiful and powerful project, I try to test some websites with trafilatura 0.6.0 in Python 3.8.

My test:

import trafilatura
from trafilatura.core import bare_extraction

downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')

result = bare_extraction(downloaded, include_formatting=False, with_metadata=True)

print(result)

The results: ({'title': None, 'author': None, 'url': None, 'hostname': None, 'description': None, 'sitename': None, 'date': None, 'categories': None, 'tags': None, 'fingerprint': None, 'id': None}, 'Leader spotlight: Erin Spiceland Every March we recognize the women who have shaped history—and now, we’re taking a look forward. From driving software development in large companies to maintaining thriving open source communities, we’re spending Women’s History Month with women leaders who are making history every day in the tech community. Erin Spiceland is a Software Engineer for SpaceX. Born and raised in rural south Georgia, she is a Choctaw and Chickasaw mother of two now living in downtown Los Angeles. Erin didn’t finish college—she’s a predominantly self-taught software engineer. In her spare time, she makes handmade Native American beadwork and regalia and attends powwows. How would you summarize your career (so far) in a single sentence? My career has been a winding road through periods of stimulation and health as well as periods of personal misery. During it all, I’ve learned a variety of programming languages and technologies while working on a diverse array of products and services. I’m a domestic abuse survivor and a Choctaw bisexual polyamorous woman. I’m so proud of myself that I made it this far considering where I came from. What was your first job in tech like? In 2007, I had a three-year-old daughter and I was trying to finish my computer science degree one class at a time, all while keeping my house and family running smoothly. I found the math classes exciting and quickly finished my math minor, leaving only computer science classes. I was looking at about five years before I would graduate. Then, my husband at the time recommended me for an entry software developer position at a telecom and digital communications company. When faced with the choice between an expensive computer science degree and getting paid to do what I loved, I dropped out of college and accepted the job. I was hired to work on internal tooling, and eventually, products. I did a lot of development on product front-ends, embedded network devices, and a distributed platform-as-a-service. I learned Java/JSP, Python, JavaScript/CSS, Node.js, as well as MySQL, PostgreSQL, and distributed systems architecture. It was an intense experience that required a lot of self-teaching, asking others for help, and daycare, but it set me up for my later successes. What does leadership mean to you in your current role? “Leadership is about enabling those below, above, and around you to be at their healthiest and most effective so that all of you can accurately understand your surroundings, make effective plans and goals for the future, and achieve those goals.” I appreciate and admire technical, effective leaders who care for their reports as humans, not as lines on a burndown chart, and forego heavy-handed direction in favor of communication and mutual dialogue. I think it’s as important for a leader to concern herself with her coworkers’ personal well-being as it is for her to direct their performance. What’s the biggest career risk you’ve ever taken? What did you learn from that experience? Last year I took a pay cut to move from a safe, easy job where I had security to work in a language I hadn’t seen in years and with systems more complicated than anything I’d worked with before. I moved from a place where I had a huge four bedroom house to a studio apartment that was twice the price. I moved away from my children, of who I share custody with my ex-husband. We fly across the U.S. to see each other now. I miss my children every day. However, I get to be a wonderful role model for them. “I get to show my children that a Native woman who grew up in poverty, lost her mother and her culture, and who didn’t finish college can learn, grow, and build whatever career and life she wants.” What are you looking forward to next? I can’t wait to wake up every day with my partner who loves me so much. I’m looking forward to showing my children exactly how far they can go. I’m excited to keep exploring Los Angeles. “I expect to learn so much more about software and about life, and I want to experience everything.” Want to know more about Erin Spiceland? Follow them on GitHub or Twitter. Want to learn more about featured leaders for Women’s History Month? Read about: Laura Frank Tacho, Director of Engineering at CloudBees Rachel White, Developer Experience Lead at American Express Kathy Pham, Computer Scientist and Product Leader at Mozilla and Harvard Heidy Khlaaf, Research Consultant at Adelard LLP Check back in soon—we’ll be adding new interviews weekly throughout March.', <Element body at 0x10680a280>, <Element body at 0x1067af080>)

So, no metadata return.

Also, I added a xpath in the metaxpaths.py and rebuild your code. I'm sure that //div[contains(@class, "post__categories")]//li//a will be match with a category in the url https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/. But no category is returned.

categories_xpaths = [
    """//div[starts-with(@class, 'post-info') or starts-with(@class, 'postinfo') or
    starts-with(@class, 'post-meta') or starts-with(@class, 'postmeta') or
    starts-with(@class, 'meta') or starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-info') or
    starts-with(@class, 'entry-utility') or starts-with(@id, 'postpath')]//a""",
    "//p[starts-with(@class, 'postmeta') or starts-with(@class, 'entry-categories') or @class='postinfo' or @id='filedunder']//a",
    "//footer[starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-footer') or starts-with(@class, 'post-info')]//a",
    '//*[(self::li or self::span)][@class="post-category" or starts-with(@class, "post__categories") or @class="postcategory" or @class="entry-category"]//a',
    '//header[@class="entry-header"]//a',
    '//div[@class="row" or @class="tags"]//a',
    '//div[contains(@class, "post__categories")]//li//a',
]

Another question is that could I get content of article including html format (no clean tags in content)?

Please help me, thanks for your support!

adbar commented 3 years ago

Hi, thanks for testing the software package. The first problem you're dealing with comes from the fact that you use the internal bare_extraction() instead of the expected extract(), please try it that way:

from trafilatura import extract
result = extract(downloaded, output_format='json')

The XPath expression you added for the metadata looks interesting, you could consider making a pull request that I could review and eventually include.

There is currently no way to output straight HTML, the XML output format features tags preserving structure and formatting.

dmoklaf commented 3 years ago

Maybe the real point is that it's not possible to get a Python data structure out of Trafilatura API. Currently we have to request XML from Trafilatura and re-parse it to get the metadata fields, thus incurring XML formating+parsing overhead. An extract_fields method returning a python namedtuple would make the whole process simpler and faster

adbar commented 3 years ago

Yes, that's a fact, I'll see if I can implement such a function.

dmoklaf commented 3 years ago

Don't hesitate if you need any help! It's an outstanding library

adbar commented 3 years ago

Just wrote a quick fix (9893e575baf0659be28aecf85cb9843ddeff4c5a). You can try it by installing the latest version from the repository: pip install -U git+https://github.com/adbar/trafilatura.git

>>> from trafilatura.core import bare_extraction
>>> downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> bare_extraction(downloaded)

@phongtnit @rgeronimi Does it answer your concerns?

dmoklaf commented 3 years ago

Yes!

I just tried it, it triggers a new exception (that does not occur with regular extract(...) calls):

File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/core.py", line 636, in bare_extraction docmeta['comments'] = xmltotxt(commentsbody) File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/xml.py", line 197, in xmltotxt for element in xmloutput.xpath('//hi|//link'): AttributeError: 'NoneType' object has no attribute 'xpath'

adbar commented 3 years ago

Thanks, I guess it's because you turned the comment output off, it should be fixed now.

dmoklaf commented 3 years ago

Yes I did turn it off.

It works! And very well, CPU load reduced, as I save the extra XML format/parse stage.

Thanks!

phongtnit commented 3 years ago

Just wrote a quick fix (9893e57). You can try it by installing the latest version from the repository: pip install -U git+https://github.com/adbar/trafilatura.git

>>> from trafilatura.core import bare_extraction
>>> downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> bare_extraction(downloaded)

@phongtnit @rgeronimi Does it answer your concerns?

Perfect, thanks @adbar and sorry for my late response.

Trafilatura is outstanding and very fast extracting text in comparison with news-please, newspaper. I'm using these libraries because of they have images in extracted article.

I have other request about including image feature in a article, it could be an option like bare_extraction(downloaded, include_images=False,...) for metadata and structure XML object. The default option is not included images in an article. My research projects need to have images inside an article. I can make the feature and promote a PR for you. After that, I could use trafilatura in my projects and maintain the feature in the future.

Another view point, I think you could consider to get text in a html tag inside an article for creating tags, it will be useful besides using xpaths for extracting tags.

adbar commented 3 years ago

Thanks for your feedback. Images are out of scope from my perspective, so please make a PR as you mentioned to include an "include_images" option if you're interested.

I'm not sure to understand what you mean by "get text in a tag inside an article for creating tags", could you please give me an example?

phongtnit commented 3 years ago

Thanks for your feedback. Images are out of scope from my perspective, so please make a PR as you mentioned to include an "include_images" option if you're interested. I'm not sure to understand what you mean by "get text in a tag inside an article for creating tags", could you please give me an example?

I will work on adding image feature and make a pull request as soon as possible.

About getting text in a html tag inside an article, for example, in this article, if you get a text in a html tag inside the article like machine learning techniques,logistic regression and more... Imgur. Thanks to that, we understand main topics of this article.

adbar commented 3 years ago

You mean that <p>You should be familiar with basic machine learning techniques like ... could become <p>You should be familiar with basic <link>machine learning techniques</link> like ... ?

phongtnit commented 3 years ago

You mean that <p>You should be familiar with basic machine learning techniques like ... could become <p>You should be familiar with basic <link>machine learning techniques</link> like ... ?

Similarly, I mention <p>You should be familiar with basic machine learning techniques like ... would be become <p>You should be familiar with basic machine learning techniques like ... and tags metadata (or maybe other metadata) merging with ['machine learning techniques',...'], especially with internal link in the article body. This is an idea for machine learning understanding key topics of an article.

adbar commented 3 years ago

Ok I understand but I'm afraid it isn't part of my priorities right now, feel free to write a PR if you wish to implement such a functionality.

phongtnit commented 3 years ago

Ok I understand but I'm afraid it isn't part of my priorities right now, feel free to write a PR if you wish to implement such a functionality.

Thanks, I will make PR for this after you accept my PR about 'adding extract image feature' because I need extracted images in my research.

adbar commented 3 years ago

Hi @phongtnit, I modified and accepted the PR in 80bdd710724d9e1ae7bbf01993347f4f07192ae9.

Do you have something else to add?