Closed doomspec closed 1 year ago
We can refer to f8c7f6c . The goal is to transform a HTML article into a Document object.
Hi I am working on the new feature so that the Wikipedia pages can be used for test. As HTML is essentially a superset of Markdown, it's natural that dozens of projects that converts HTML into Markdown already exist. I find a good project html2text. Hopefully, the issue can be resolved soon with the existing solution.
Question, why don't we just use XML format as the standard Document
format?
XML and JSON are more or less equivalent. But HTML has a bad feature in that the section is not represented by a surrounding tag but just the tag of the subsection title, like <h1><h2><h3>
..... LaTeX is also like this, which uses /section, /subsection, /subsubsection, etc. Therefore, HTML and latex do not fully give the tree structure. The Document
class is mainly for converting them to a tree structure. Because when it is fully a tree structure, JSON and XML are equivalent, so a Python object would suffice.
I think you can develop something to extract articles from certain websites including Wikipedia, Stanford Encyclopedia of Philosophy, etc. You may use Beautiful Soup
to achieve this.
Then, you can get the section structure like what I did in latex.py.
Then you can turn the content of sections into markdown using python-markdownify.
Hi Zijian. I think your suggestion has been implemented by various NLP packages, including Langchain (https://python.langchain.com/docs/integrations/document_transformers). I am currently using it for text extract from arbitrary URLs. (They also provided various other functions, for example, extraction from Google docs.) I think there is no need to reinvent the wheel, we can just use it as dependency, what do you think?
Sounds great. Thank langchain.
HTML parsing is currently done, the implementation only used html2text
. We can add new features withbeautifulsoup
or langchain
in the later update. The issue can be closed at this moment.
We need to make HTML data cleaner so that we can extract articles from web pages.
Important notice: