TechnionYP5777 / Bugquery

Bug query
9 stars 1 forks source link

Thinking group - Explore what PL we should use for parsing the web #98

Closed yonzarecki closed 7 years ago

yonzarecki commented 7 years ago

Check which programming-language is best for parsing the web when looking for traces in the web (as we discussed for the future). It can be python (strong contender), Java, C# whichever fits best.

yonzarecki commented 7 years ago

Looking into python html parsing libs. (I'll update my finding in this comment continuously) From the standard library we have HTMLParser which looks pretty convenient. BeautifulSoup is also quite popular and I've heard of it before.

tonylekhtman commented 7 years ago

I have some experience in parsing using python, from other projects. I used lxml, xpath library. Pretty convenient.

yonzarecki commented 7 years ago

For Java I've used htmlunit in the past, which is a GUI-less browser and is also capable of basic parsing of web-pages, this was sufficient at the time.

For more advanced html parsing after looking into these SO questions - link, link, link

The most popular suggestion is JSoup. Another suggestion is JTidy, which handles broken HTML but is less established than JSoup. In my opinion JSoup is the better choice, maybe combined with HTMLUnit that can handle the web page accesses.

ZivIzhar commented 7 years ago

What's the point in assigning me to that if you're already doing that? :/ Anyways, I went (in vain) over some parsing libraries.

Basically, our two main options are Java or Python. In general, most of the people use JSoup (java). There are many python web parsers, but the best is probably BeautifulSoup.

Since most of our project is in java, and there is no obvious advantage to python, I think we should stick to java.

Regarding HtmlUnit and JTidy - these are'nt meant for data extraction. link link

JSoup has a very easy to use API, and is super powerful for web parsing. After going over several web parsing libraries, I think that JSoup is the best for us since it has everything we'll probably need.

ZivIzhar commented 7 years ago

There's this link with many html parsers names link

yonzarecki commented 7 years ago

Well I assigned both of us, to make us both go over the subject, and the fact that we got to the same conclusion makes it even stronger. That's the strength of the thinking group in my opinion

yonzarecki commented 7 years ago

Quick look into C#, it doesn't seem to have an exceptionally strong parsing lib, and as our team prefers Java and python we won't use it probably.

So to summarize, if we want to incorporate the parsing core into all of our other code we can and should use Java. However, if we want a run-once solution python will probably be better because of the faster coding speed.