farhan-helmy / carisurau

https://carisurau.com
MIT License
66 stars 36 forks source link

gmap_scrape not yet functioning #96

Closed Ny0ttt closed 9 months ago

allaboutevemirolive commented 10 months ago

Hi @Ny0ttt , do you still working on this? I would like to review the code if it is ready or is still underway since I have experience with the playwright module.

Ny0ttt commented 10 months ago

@allaboutevemirolive Hi. Yes, I am still working on it. I will pass the coding file that i am working on and list out the details. I am sorry for taking time and not finishing the work. I appreciate the help.

Ny0ttt commented 10 months ago

@allaboutevemirolive Alright, I have modified the file to the latest workings that I did. Please refer to the comments in the file for reference for the flow of work that I am working on. I am sorry for any difficulties I have made. I really appreciate the help. I will leave this to you. Thank you.

Ny0ttt commented 10 months ago

@allaboutevemirolive Hello. Sorry, just to confirm. So you will be taking over from now right? If I would like to continue the work back again, should I wait for your update or should I just continue from where I left it? Sorry for any inconvenience 🙏🏻

allaboutevemirolive commented 10 months ago

Hi @Ny0ttt , I'm sorry for the inconvenience, too. You can continue where you left off. React is not my daily tool, but I have seen your code, and if I understand correctly from your code, you are trying to do real-time scraping, right?

From my experience, real-time scraping doesn't work properly due to the unpredictable HTML updates on the website and consumes a lot of time. Probably, you can work by scraping data and then feeding it to the database. This approach is more practical, and you can process the data further before feeding it to the database.

Ny0ttt commented 10 months ago

@allaboutevemirolive Actually I am not making it as a real-time scraping. I have consulted with one of the owners (or main collaborators) to not make it real-time since I am new here.

Alright, I will continue from where I left from. Thank you again 👍🏻

Xavier-IV commented 9 months ago

Hi, I see this PR is still open which is great!

I would suggest to keep it as draft PR (Github has option to mark PR as draft)

image
allaboutevemirolive commented 6 months ago

Trying to analyze this approach again. The idea for web scraping is fine, but there are several factors that need to be considered.

For mass scraping, using web automation tools like Selenium and Playwright is not the most suitable option. It's better to use specialized tools like Apache Nutch and parallelize the tasks at some point.

On the other hand, rather than scraping data with images, it might be more convenient to populate the database with data excluding images.

For example, if a user asks for the nearest surau, CariSurau will only provide a list of processed data, surau status (whether it's open 24 hours or not), and the path to the nearest surau. If a user wants to view surau images, CariSurau will provide a link to the images on Google Map. Of course, this approach will necessitate maintainers redesigning how CariSurau should appear, whether as a website or in another form. I believe that if CariSurau has a Whatsapp-like version, end users will gain from this strategy.

This way, we can avoid scraping every image directly.

farhan-helmy commented 6 months ago

Thanks for the insight, to be honest I do not have clear startegy for this. I put 100% freedom for the maintainer to do their own approach for the sake of learning. Great information boss! I love you idea on how to cover both parties ( end user and devs) for this feature :D @allaboutevemirolive