Open TechnikEmpire opened 6 years ago
any progress on this ?
Nope and this probably won't be getting attention for a long time as I have nothing driving that change.
The parsing is HORRIBLY slow: I have packed gqsrc.zip such that it is reproducible. The only thing I changed in the source is to use "en_US.UTF8" instead of "en_US" as described in
Simply unpack everything and compile. The program is very easy and is based on the example code:
#include "Document.hpp"
#include <iostream>
#include <string>
#include <fstream>
#include <chrono>
using namespace std;
std::string loadFileToString(const std::string &filename){
std::ifstream t(filename);
t.seekg(0, std::ios::end);
size_t size = t.tellg();
std::string buffer(size, ' ');
t.seekg(0);
t.read(&buffer[0], size);
return buffer;
}
int main() {
string html = loadFileToString("./html/coinmarketcapAll.html");
std::string &someHtmlString = html;
std::string someSelectorString = "div";
auto testDocument = gq::Document::Create();
auto start= chrono::system_clock::now();
testDocument->Parse(someHtmlString);
auto startcss= chrono::system_clock::now();
try
{
auto results = testDocument->Find(someSelectorString);
const auto numResults = results.GetNodeCount();
cout<<"numResults: "<<numResults<<endl;
}
catch(std::runtime_error& e)
{
throw e;
// Necessary because naturally, the parser can throw.
}
auto end= chrono::system_clock::now();
cout<<"Time Parse : "<<chrono::duration_cast<chrono::microseconds>(startcss-start).count()<< " microseconds."<<endl;
cout<<"Time CSS : "<<chrono::duration_cast<chrono::microseconds>(end-startcss).count()<< " microseconds."<<endl;
return 0;
}
You can compile it in the unpacked directory gqsrc
with the command g++ *.cpp -lgumbo
(for simplicity, everything from GQ is static, except for libgumbo
). If I run the executable with ./a.out
, I get
numResults: 1681
Time Parse : 107849066 microseconds.
Time CSS : 55982 microseconds.
And yes, that is more than 100 seconds using all my CPU resources on a i7 7700HQ!!! If I turn on heavy optimization with g++ *.cpp -lgumbo -flto -Ofast -fwhole-program
, ./a.out
gives me
numResults: 1681
Time Parse : 5015653 microseconds.
Time CSS : 7778 microseconds.
which is still a lot. For comparison, the following Julia code is doing the same (use Pkg.add()
to install the three packages)
using Cascadia
using Gumbo
using Requests
r = get("https://coinmarketcap.com/all/views/all/")
@time h = parsehtml(convert(String, r.data));
@time cur = matchall(Selector("div"),h.root);
and takes this amount of time on cold start (Julia is JIT):
julia> @time h = parsehtml(convert(String, r.data));
0.906903 seconds (1.11 M allocations: 68.890 MiB, 2.61% gc time)
julia> @time cur = matchall(Selector("div"),h.root);
1.310899 seconds (1.52 M allocations: 96.578 MiB, 1.70% gc time)
and this amount of time on repeated evaluations:
julia> @time h = parsehtml(convert(String, r.data));
0.243488 seconds (897.90 k allocations: 57.309 MiB, 8.71% gc time)
julia> @time cur = matchall(Selector("div"),h.root);
0.614870 seconds (1.27 M allocations: 82.394 MiB, 13.57% gc time)
This is less than a second. I wanted to switch to C++ for more speed less needed ressources...
I did the same for gumbo-query and packed gumboquerysrc.zip. With the program
#include <iostream>
#include <string>
#include "Document.h"
#include "Node.h"
#include <fstream>
#include <chrono>
using namespace std;
std::string loadFileToString(const std::string &filename){
std::ifstream t(filename);
t.seekg(0, std::ios::end);
size_t size = t.tellg();
std::string buffer(size, ' ');
t.seekg(0);
t.read(&buffer[0], size);
return buffer;
}
int main(int argc, char * argv[])
{
string html = loadFileToString("./html/coinmarketcapAll.html");
std::string &page=html;
CDocument doc;
auto start= chrono::system_clock::now();
doc.parse(page.c_str());
auto startcss= chrono::system_clock::now();
CSelection c = doc.find("div");
auto end= chrono::system_clock::now();
cout<<"numResults: "<<c.nodeNum()<<endl;
cout<<"Time Parse : "<<chrono::duration_cast<chrono::microseconds>(startcss-start).count()<< " microseconds."<<endl;
cout<<"Time CSS : "<<chrono::duration_cast<chrono::microseconds>(end-startcss).count()<< " microseconds."<<endl;
return 0;
}
I get the following results. g++ *.cpp -lgumbo
and ./a.out
gives me
numResults: 1681
Time Parse : 105975 microseconds.
Time CSS : 23936 microseconds.
and from g++ *.cpp -lgumbo -flto -Ofast -fwhole-program
, ./a.out
, I get
numResults: 1681
Time Parse : 104778 microseconds.
Time CSS : 2607 microseconds.
I have to admit that I like the GQ interface with namespaces more than that of gumbo-query, but without a performance improvement, GQ is not competetive.
Parse with raw gumbo interface for comparison please and then pass the constructed gumbo output to the appropriate constructor for gq document. Please benchmark both the raw gumbo parsing and the construction of gq document separately.
Please also parse the results on your machine of the exact example program without modifications (except the locale fix I guess).
Parsing takes a hit because of the tree I build for rapid selector execution but this obviously isn't right whatever is happening and I've never seen this.
Also you said about gq competing with gumbo-parser. Was never meant to be a competition. Gumbo parser is so full of bugs I can't even tell you. Just try to run some of the tests I've made here with it, they'll fail entirely.
Using a modified Parse()
function (I just inserted the header <chrono>
before and added three checkpoints and two cout
's) we can see that the Init()
function is causing the delay.
void Document::Parse(const std::string& source)
{
auto startparse=std::chrono::system_clock::now();
// No attempting to parse empty strings.
#ifndef NDEBUG
assert(source.length() > 0 && (source.find_first_not_of(u8" \t\r\n") != std::string::npos) && u8"In Document::Parse(const std::string&) - Empty or whitespace string supplied.");
#else
if (source.length() == 0) { throw std::runtime_error(u8"In Document::Parse(const std::string&) - Empty string supplied."); }
#endif
if (m_gumboOutput != nullptr)
{
gumbo_destroy_output(&m_parsingOptions, m_gumboOutput);
}
m_gumboOutput = gumbo_parse_with_options(&m_parsingOptions, source.c_str(), source.size());
// Check if we got coal in our stocking.
if (m_gumboOutput == nullptr)
{
throw std::runtime_error(u8"In Document::Parse(const std::string&) - Failed to allocate GumboOutput.");
}
// Check if we got absolutely nothing in our stocking. If we didn't then, santa isn't real.
if (m_gumboOutput->root == nullptr
|| m_gumboOutput->root->v.element.children.length == 0
|| static_cast<GumboNode*>(m_gumboOutput->root->v.element.children.data[0])->v.element.children.length == 0)
{
throw std::runtime_error(u8"In Document::Parse(const std::string&) - Failed to generate any HTML nodes from parsing process. The supplied string is most likely invalid HTML.");
}
m_node = m_gumboOutput->root;
m_treeMap.Clear();
auto beforeInit=std::chrono::system_clock::now();
// Must call init to build out and index and children children.
Init();
auto afterInit=std::chrono::system_clock::now();
std::cout<<"Gumbo parsing : "<< std::chrono::duration_cast<std::chrono::microseconds>(beforeInit-startparse).count()<<" microseconds\n";
std::cout<<"Init() in Document::Parse(): "<< std::chrono::duration_cast<std::chrono::microseconds>(afterInit-beforeInit).count()<<" microseconds\n";
}
The new main.cpp
function, I am using is
#include "../include/Document.hpp"
#include <iostream>
#include <string>
#include <fstream>
#include <chrono>
using namespace std;
std::string loadFileToString(const std::string &filename){
std::ifstream t(filename);
t.seekg(0, std::ios::end);
size_t size = t.tellg();
std::string buffer(size, ' ');
t.seekg(0);
t.read(&buffer[0], size);
return buffer;
}
int main(int argc,const char** argv) {
if (argc!=2) {
cout<<"Usage: "<<argv[0]<<" url "<<endl;
return 0;
}
string filename{argv[1]};
cout<<filename<<endl;
// return 0;
string html = loadFileToString(filename);
std::string &someHtmlString = html;
std::string someSelectorString = "div";
auto testDocument = gq::Document::Create();
auto start= chrono::system_clock::now();
testDocument->Parse(someHtmlString);
auto startcss= chrono::system_clock::now();
try
{
auto results = testDocument->Find(someSelectorString);
const auto numResults = results.GetNodeCount();
// cout<<"numResults: "<<numResults<<endl;
}
catch(std::runtime_error& e)
{
throw e;
// Necessary because naturally, the parser can throw.
}
auto end= chrono::system_clock::now();
// cout<<html.size()<<", "<<
cout<<"Time Parse() : "<<chrono::duration_cast<chrono::microseconds>(startcss-start).count()<< " microseconds."<<endl;
cout<<"Time CSS selection : "<<chrono::duration_cast<chrono::microseconds>(end-startcss).count()<< " microseconds."<<endl;
cout<<"Length of HTML string : "<<html.length()<<endl;
return 0;
}
Note that I cannot use the exact example program because it does not provide the functionality to load a string from file and it does not have the timing checkpoints. But you can see that the program I use is basically the exact example program with time measuring functionality and the ability to load a string from a file, so nothing special. I packed it in gqsrc.zip with three test html files which can be passed as the first command line argument to the executable to be generated with g++ *.cpp -lgumbo
in the unpacked directory. The results are:
For a large HTML (2.5MB)
Gumbo parsing : 104161 microseconds
Init() in Document::Parse(): 104796457 microseconds
Time Parse() : 104900633 microseconds.
Time CSS selection : 58580 microseconds.
Length of HTML string : 2530379
For a mid HTML (160KB)
Gumbo parsing : 29905 microseconds
Init() in Document::Parse(): 250702 microseconds
Time Parse() : 280621 microseconds.
Time CSS selection : 694 microseconds.
Length of HTML string : 158461
For a small HTML (75KB)
Gumbo parsing : 4480 microseconds
Init() in Document::Parse(): 48563 microseconds
Time Parse() : 53059 microseconds.
Time CSS selection : 410 microseconds.
Length of HTML string : 75829
I think the algorithm used in Init()
is not efficient for large HTMLs.
@phinzphinz Ah ok so there it is, it's the size variation.
That whole tree system I made was sort of whipped up in the moment without any planning. I'm thinking I need to just blow that away and let the queries execute dynamically like gumo-query does.
Thanks for the reports and for doing the testing. I'm pretty piled up with work right now but I'll see if I can't get around to looking at this. Pretty sure my goal at this point is going to be delete the Tree class files and fix until it compiles/works again.
@TechnikEmpire I'm trying to download, parse and extract information from http://vip.stock.finance.sina.com.cn/corp/go.php/vRPD_NewStockIssue/page/1.phtml.
Here's code:
auto document = gq::Document::Create();
document->Parse(utf8Body);
stopWatch.Stop();
logger_.Debug(WStringUtils::Format(L"Parsed html in [%.2fms]", stopWatch.GetLastTaskTimeMillis()));
stopWatch.Start();
std::vector<StockInfo> stockInfos;
auto trNodes = document->Find(GetSelector("#NewStockTable tbody tr"));
for (auto i = 2U; i < trNodes.GetNodeCount(); i++) {
StockInfo stockInfo;
auto trNode = trNodes.GetNodeAt(i);
stockInfo.SetSymbol(trNode->GetChildAt(0)->GetText());
stockInfo.SetPurchaseSymbol(trNode->GetChildAt(1)->GetText());
stockInfo.SetName(WStringUtils::TrimSpace(trNode->GetChildAt(2)->GetText()));
auto issueDateStr = trNode->GetChildAt(3)->GetText();
auto issueDate = DateUtils::ParseAsLocal(issueDateStr);
stockInfo.SetIssueDate(issueDate);
auto issuePrice = StringUtils::To<double>(trNode->GetChildAt(7)->GetText());
stockInfo.SetIssuePrice(issuePrice);
stockInfos.push_back(stockInfo);
}
stopWatch.Stop();
logger_.Debug(WStringUtils::Format(L"Extracted [%d] new stocks in [%.2fms]", stockInfos.size(), stopWatch.GetLastTaskTimeMillis()));
Here's the log: 2018-05-27 23:03:42.496175 [0x00004290] debug - Downloaded html in [99.00ms] 2018-05-27 23:03:53.082191 [0x00004290] debug - Parsed html in [10584.00ms] 2018-05-27 23:03:55.506796 [0x00004290] debug - Extracted [40] new stocks in [2422.00ms]
The html is about 442KB. It takes 10 seconds to parse html, and 2.4 seconds to extract information. This is too slow for my application.
Ok I'm going to actually look at this today because that's brutal. I merged in a PR or two without testing, that's about all I can figure in terms of what went wrong.
I've been looking at myhtml and it has a selector engine in modest. This project might get killed and become a wrapper around that if this issue is that serious. Last time I benchmarked this code, it took 83 msec to parse Yahoo's landing page and build the selector tree/cache. So I dunno what's happened but I'll render some action against this issue at some point today.
Is everyone here testing on Linux?
I'm using GQ on windows.
@sinall Thanks.
Ok folks well, you have to weigh in. I can bring parsing back to basically raw gumbo parsing speed, but the consequence is that the super fast selector system is RIP in peace.
Time taken to run 27646 selectors against the document: 75036.2 ms producing 1685 total matches.
Processed at a rate of 2.71418 milliseconds per selector or 0.368436 selectors per millisecond.
Benchmarking mutation.
Time taken to run 27646 selectors against the document while serializing with mutations 1 times: 74865.3 ms.
Time per cycle 74865.3 ms.
Processed at a rate of 2.708 milliseconds per selector or 0.369277 selectors per millisecond.
Not included here, the parsing stage was ~180 msec IIRC.
Using @phinzphinz 's huge >2mb HTML sample he submitted, those are the stats. Takes nearly 3 msec to run a single selector through the entire parsed body.
I guess it's up to what people want. To me, this makes the whole project garbage and a waste of time, but that's just my opinion.
@sinall Oh, definitely don't use debug lol, debug is absolutely horridly slow. Test speeds in release. Anyway I'm working on this.
@TechnikEmpire I tried release mode, here's the log:
2018-05-30 11:05:18.679213 [0x00002a78] debug - Downloaded html in [129.00ms] 2018-05-30 11:05:18.938386 [0x00002a78] debug - Parsed html in [259.00ms] 2018-05-30 11:05:19.022441 [0x00002a78] debug - Extracted [40] new stocks in [83.00ms]
This is acceptable for me at this moment. And I'm looking forward to your fast selector system in future.
@sinall definitely there is an issue that is killing performance and I have identified it. I'm working on rewriting this stuff but it's core functionality so it's extensive work.
In the meantime, smaller HTML shouldn't be a big deal.
My advice to everyone here is to switch to my_html (or whatever bundle of his has selectors). The design of Gumbo is such that internals are kept away from the user entirely, and to make this function properly and without a slowdown as HTML size increases, I need to do invasive changes to Gumbo. Frankly I can't decide what is more painful: writing my own compliant parser based on the current spec in C++ or being forced to maintain an abandoned C library written with precisely 0 comments detailing anything.
It would seem that our tree building process is quite expensive. IIRC, when I looked into this some time ago, it wasn't gumbo's parsing speed really dragging things down, but rather the tree building we do post-gumbo parsing.
Should be able to speed that up.