HTML data cleaner - Githubissues

doomspec commented 1 year ago

We need to make HTML data cleaner so that we can extract articles from web pages.

Important notice:

HTML might include invalid elements like images, which cannot be processed by language models. We need to remove them.
The output can refer to the existing latex data cleaner.

doomspec commented 1 year ago

We can refer to f8c7f6c . The goal is to transform a HTML article into a Document object.

EigenSolver commented 1 year ago

Hi I am working on the new feature so that the Wikipedia pages can be used for test. As HTML is essentially a superset of Markdown, it's natural that dozens of projects that converts HTML into Markdown already exist. I find a good project html2text. Hopefully, the issue can be resolved soon with the existing solution.

EigenSolver commented 1 year ago

Question, why don't we just use XML format as the standard Document format?

doomspec commented 1 year ago

XML and JSON are more or less equivalent. But HTML has a bad feature in that the section is not represented by a surrounding tag but just the tag of the subsection title, like <h1><h2><h3>..... LaTeX is also like this, which uses /section, /subsection, /subsubsection, etc. Therefore, HTML and latex do not fully give the tree structure. The Document class is mainly for converting them to a tree structure. Because when it is fully a tree structure, JSON and XML are equivalent, so a Python object would suffice.

doomspec commented 1 year ago

I think you can develop something to extract articles from certain websites including Wikipedia, Stanford Encyclopedia of Philosophy, etc. You may use Beautiful Soup to achieve this.

Then, you can get the section structure like what I did in latex.py.

Then you can turn the content of sections into markdown using python-markdownify.

EigenSolver commented 1 year ago

Hi Zijian. I think your suggestion has been implemented by various NLP packages, including Langchain (https://python.langchain.com/docs/integrations/document_transformers). I am currently using it for text extract from arbitrary URLs. (They also provided various other functions, for example, extraction from Google docs.) I think there is no need to reinvent the wheel, we can just use it as dependency, what do you think?

doomspec commented 1 year ago

Sounds great. Thank langchain.

EigenSolver commented 1 year ago

HTML parsing is currently done, the implementation only used html2text. We can add new features withbeautifulsoup or langchain in the later update. The issue can be closed at this moment.

EvoEvolver / EvoNote

HTML data cleaner #8