Closed yonzarecki closed 7 years ago
Looking into python html parsing libs. (I'll update my finding in this comment continuously) From the standard library we have HTMLParser which looks pretty convenient. BeautifulSoup is also quite popular and I've heard of it before.
I have some experience in parsing using python, from other projects. I used lxml, xpath library. Pretty convenient.
For Java I've used htmlunit in the past, which is a GUI-less browser and is also capable of basic parsing of web-pages, this was sufficient at the time.
For more advanced html parsing after looking into these SO questions - link, link, link
The most popular suggestion is JSoup. Another suggestion is JTidy, which handles broken HTML but is less established than JSoup. In my opinion JSoup is the better choice, maybe combined with HTMLUnit that can handle the web page accesses.
What's the point in assigning me to that if you're already doing that? :/ Anyways, I went (in vain) over some parsing libraries.
Basically, our two main options are Java or Python. In general, most of the people use JSoup (java). There are many python web parsers, but the best is probably BeautifulSoup.
Since most of our project is in java, and there is no obvious advantage to python, I think we should stick to java.
Regarding HtmlUnit and JTidy - these are'nt meant for data extraction. link link
JSoup has a very easy to use API, and is super powerful for web parsing. After going over several web parsing libraries, I think that JSoup is the best for us since it has everything we'll probably need.
Well I assigned both of us, to make us both go over the subject, and the fact that we got to the same conclusion makes it even stronger. That's the strength of the thinking group in my opinion
Quick look into C#, it doesn't seem to have an exceptionally strong parsing lib, and as our team prefers Java and python we won't use it probably.
So to summarize, if we want to incorporate the parsing core into all of our other code we can and should use Java. However, if we want a run-once solution python will probably be better because of the faster coding speed.
Check which programming-language is best for parsing the web when looking for traces in the web (as we discussed for the future). It can be python (strong contender), Java, C# whichever fits best.