GoncaloMark / MunchLex

A simple HTML Lexer/Parser to wrap my head around lexers and parsers, the foundations of a compiler/interpreter. Since I love webscraping thought I'd give it a try!
1 stars 1 forks source link

Fix Documentation (Readme.md) #8

Closed GoncaloMark closed 5 months ago

GoncaloMark commented 6 months ago

Documentation is very poor, how to run the project, core aspects, future ideas, etc...

Jainil2004 commented 6 months ago

hi @GoncaloMark I am interested in solving this issue and contribute to your project. can you please assign me this issue? Also please do provide me with some steps like the setup of the project etc thank you

GoncaloMark commented 6 months ago

Hello @Jainil2004, first of all thank you very much for taking interest in this project, hope you have a nice time contributing to it! I shall get back to you in a few hours as I'm currently planning a robotics event and have some meetings scheduled. I'll assign you and later specify you with those setup steps and elaborate on the issue! Thank you very much :)

Jainil2004 commented 5 months ago

@GoncaloMark thank you so much for assigning me this issue and I look forward to working with you.

GoncaloMark commented 5 months ago

Okay Jainil, sorry for the waiting! So I would advise you first to work off the dev branch you can clone checkout a new branch for you and once you're done you create a PR to dev.

I advise you to work off dev because the master branch is the legacy version with a semaphore and its working, with it outputting the Syntax Tree on the stdout (want to change this to log files!). The dev one has my commits regarding the use of a thread pool and an argument parser I built for configuration.

Currently the dev branch, it's not working quite well because the thread pool solution, I've been working on it, and maybe tomorrow I can get it going corrently since it's start of weekend I'll have time available. So the dev branch is not quite parsing the HTML.

To get it going you got to run: make all

This will create the executable by compiling and linking the project. Then you may enter the test dir and run the shell script in there, analyze it, it has the flags there, the "-p" flag can be either static or Daemon but Daemon is not implemented yet!

cd test chmod +x run_multithread.sh ./run_multithread.sh

You may play with the arguments to see the difference. All functions, structs and enums are documented with doxygen style comments! So you can find more info on there, any doubts post it here! Feel free to ask me any questions if I wasn't really clear with this message!

If you want to run from the master branch, to check out the tree structure and the html parsing you basically switch to master and you make the project again, run the same shell script and output should get you the HTML file contents and tokenized. Play with the HTML files to see how it builds the hierarchy of children, etc...

Jainil2004 commented 5 months ago

alright understood but what exactly are we trying to achieve here or fix? I am sorry but I am new to open source and don't have much knowledge. but I am interested in learning and improving. hope you understand. thank you

GoncaloMark commented 5 months ago

Hello again @Jainil2004. Had a busy Friday. No need to apologize I understand, we're all learning and making our own path in the OSS community. I'm actually flattered by your interest in this project.

I started MunchLex because I wanted to build a programming language builder/generic Lexer and Parser, kinda like Antlr. But once my first and biggest OSS project got to 30 stars here on GH, the CobWeb web scraper, I found it limiting that I could only scrape html sequentially since python threads are not truly parallel. So I wanted to build a multithreaded fast html parser without lots of bloat.

As for this issue we want to document th current could by elaborating an expressive Readme.md with a project description and the instructions like the different flags and about the tree structure that will have some search functions. I know it's all a bit confusing, but I'm also kinda lost here still because I just started this project. If you need any more clarification feel free to ask! I truly understand and I'm sorry for the frustration this might be causing because the project is still a mess, I'd love to hear you about what I could do better to organize this also!

Thank you :))

Jainil2004 commented 5 months ago

@GoncaloMark well I had also made a small program for web scraping in python. but the idea of having parallel threads working together for this job never came in my mind so well here I am :) Actually my idea was that since I've made a small program related to the same idea. I thought well this is something I've done before and adding multi-threading would be a great idea and wanted to contribute in any way. And hence that's why I picked this issue. So can you like tell me like how to proceed with the documentation of the project? also well I tried setting up the project on my machine but I was not able to get it up and running. because well windows and all the problems associated with it. so just tell me like how to document this project while I still try to figure out why it is not working on my machine. thank you

GoncaloMark commented 5 months ago

Hello! Sorry I just saw your comment. I'm glad you find web scraping passionate as well. This project was not tested on Windows and probably won't run smoothly on it for now. I'm on an Ubuntu machine but you can setup WSL2 on Windows and it will run no problem because that I've tried. I'd say you can document the way the program runs, the multithreaded environment and what flags are mandatory and which aren't, mention this that it ain't tested on Windows and also the data structures used and how to run an example.

Jainil2004 commented 5 months ago

alright noted. I will start the work on the documentation and will update you once I have made some major progress. thanks

Jainil2004 commented 5 months ago

hi @GoncaloMark sorry for the long wait, but I've successfully completed the documentation for the project. The reason it took a lot of time was because I had trouble setting up the project as I wanted to use it and understand its working. I have added a good documentation for the project based on my understanding and our conversation. it explains the idea behind the project, its working, and the setup of the project. Please do have a look and let me know if somethings are required to be changed. Once again I apologize for the delay and I am grateful to you for providing me with this opportunity and to be part of this project. thank you

GoncaloMark commented 5 months ago

Hello @Jainil2004, how have you been? Sorry for the late response, I was on a pause/vacation. No need to apologize my friend, I am equally very grateful for your interest and passion on taking this issue! Thank you for very much indeed! I will analyze your PR later because I have some other affairs to take care right now. I'll get back to you but will link your PR to this issue right now. Thank you for taking care of this issue and elaborating this documentation step. Hope to collaborate further with you in the future :)

GoncaloMark commented 5 months ago

It's merged! Thank you very much @Jainil2004! Good job! :)

Jainil2004 commented 5 months ago

Thank you so much @GoncaloMark would love to contribute with you again ❤️