GoncaloMark / MunchLex

A simple HTML Lexer/Parser to wrap my head around lexers and parsers, the foundations of a compiler/interpreter. Since I love webscraping thought I'd give it a try!
1 stars 1 forks source link
c html-parser lexer lexer-parser

MunchLex

comments from the original author: " A simple HTML Lexer/Parser to wrap my head around lexers and parsers, the foundations of a compiler/interpreter. Since I love webscraping thought I'd give it a try! "

Introduction and Project Idea

Multi-Threaded Web Scraper in C

Munchlex is a multi-threaded web scraper written in C that resolves around the idea of a lexer/parser. The origin of this project started when the need to further enhance the performance of a python based web scraper was required. originally Python based web scrapers though efficient, were only sequential in nature. This means that the page was parsed line by line which was time consuming. Hence, Munchlex and the idea of a multi-threaded web scraper was born.

Munchlex: Overview

the multi-threaded web scraper is designed to take advantage of concurrent execution by utilizing multiple threads. and hence a departure from the traditional sequential web scrapers written in languages like Python. the use of multiple threads aims to enhance the speed and efficiency of the web scraper. making it capable to handle the large scale data extraction and processing

Features

How to use Munchlex

This section contains steps that one can use to get there own copy of Munchlex

Working of the Project

Basic idea of the working

The web scraper works by using multiple threads to scrape the data from the web pages. the function 'munchLex' processes each line fo the input file, tokenizes it and then constructs a tree of tokens which represents the document structure. the resulting tree is also printed to the log file. The function performs lexcial analysis fo the page.

In detail working

Future Work

Conclusion

The multi-threaded web scraper is designed to be fast and efficient. It is capable of handling large scale data extraction and processing easily and more efficiently.

Proudly brought to you by the Munchlex team and the open source community ❤️