Open Dingo64 opened 9 years ago
./edbrowse https://example.com/ > out.htm Of course edbrowse-js is in the same directory and has exec rights.
I notice you exec via ./edbrowse, perhaps your $PATH does not contain . They sometimes don't out of the box, in which case the execlp() command would fail.
You say this is an attempt to "save a web page to file", but out.htm would simply contain the output of edbrowse. You want to bring in example.com, then from within the edbrowse session w out that would save the formatted web page. or perhaps ub then w out if you want the raw html.
Karl Dahlke
Thanks, I did export PATH and now this error is gone. But can I just use it like wget? Download a file and save the final output (after running JS) to file?
can I just use it like wget? Download a file and save the final output (after running JS) to file?
Yes you can save the original html or the formatted text. w/ is a convenient command, saves it to the filename part of the url.
Karl Dahlke
Thanks! Can I do this non-interactively? Like edbrowse http://google.com -w output.htm?
Can I do this non-interactively? Like edbrowse http://google.com -O output.htm
Well not like that.
edbrowse http://google.com <<! w google-home.browse ub w google-home.htm q !
You can also write edbrowse scripts in $HOME/.ebrc to do various tasks, somewhat like shell functions in $HOME/.bashrc See the sample config file in the documentation or edbrowse.wiki on github user cmb
Karl Dahlke
Unfortunately the unbrowse command ub
might not quite do what you want, if you wanted to see the results of having the Javascript run. Unbrowse is like the View Source command in a graphical browser: it shows you the original page source. It does not show you the modified version of the DOM tree after the scripts have run. Yes the formatted text shows the Javascript result, but the unbrowsed version just shows you the original source. If by "like wget
" you meant "like wget
plus Javascript DOM changes put back into the source", that's more complex. And no it won't work to inject an extra piece of Javascript into the page like document.body.innerHTML=document.all[0].outerHTML.replace(/&/g,'&').replace(/</g,'<').replace(/\n/g,'<br>')
(which is supposed to read the DOM back into source form and format this so that the formatted display will be the DOM markup); the reason why this won't work is that edbrowse
's DOM support is not complete enough. To clarify, the Javascript engine behind edbrowse
is the same one that runs Firefox, but that's only the Javascript engine, not the DOM. The Mozilla SpiderMonkey Javascript engine provides the Javascript interpreter, but the DOM itself still has to be provided by edbrowse, and if we look into the edbrowse source at src/jseng-moz.cpp
we can see the JS_DefineProperty
call for innerHTML
rigs up a "setter" but not a "getter". This means edbrowse has write-only support for the innerHTML
property; attempting to read back an element's innerHTML
will get an empty string. And the outerHTML
property is not supported at all. If I'm doing Web programming and I ever need to do a "quick hack" along the lines of x.innerHTML = x.innerHTML.replace(y,z)
I always try to remember to enclose it in an if (x.innerHTML)
to verify that we have both read and write support for innerHTML
, because if any SpiderMonkey-derived browser has write-only support for this property then a read-modify-write would become a delete. If you really want to inject Javascript into the current version of edbrowse
to give you the DOM, you'll have to write a rather roundabout script to walk through the DOM nodes itself, using only the features that edbrowse already implements, building up the markup string as it goes, but if you're going to go to that much effort then you almost might as well do it in C and thereby contribute innerHTML
read support to edbrowse. (By the way I'm not sure I'd be the best one to code this because I did the exact same job for a commercial company 10 years ago and I don't want to raise questions about did my contribution somehow taint the free code base. But I can still sit here and point it out.)
In the meantime there is PhantomJS which has more complete DOM support but it is not as lightweight as edbrowse. For example in Python (adapted from Web Adjuster):
from selenium import webdriver
import time
wd = webdriver.PhantomJS(service_args=['--ssl-protocol=any'])
wd.get(url)
time.sleep(2) # wait for onTimeout events
print wd.find_element_by_xpath("//*").get_attribute("outerHTML").encode('utf-8')
wd.quit()
but none of these considerations apply if you merely wanted to view the formatted text of pages that don't need things like read access to innerHTML
and you don't need to see a markup representation of the modified DOM but just want to read what the page says: in that case edbrowse should be just fine.
Well there's a lot of information here. First, with all your skills and experience, I wish we could recruit you as a programmer. It's just a couple of us spare time, no compensation, etc.
We're looking at phantom js, but it's almost a start-over, and I don't know that any of us has the time for that. I have so manyh personal issues write now, I barely have time to write this email.
I don't think I follow when you say innerHTML is not implemented. You can read it and write it. I just verified this with my test programs and jdb. I pushed a button which changes the innerHTML under a
Karl Dahlke
Ah, I misremembered the order of parameters when I saw the call to JS_DefineProperty. Should have double checked. It does in fact give innerHTML an initial value of empty but does not rig up a getter to return empty. The getter is left as null and the setter is rigged up. That means for one thing you can set innerHTML to any value you like and then read back the value you set. But the next question is, can you read its value before you even set it? And it turns out you can, in some cases, but not all. The code I should have looked at is src/decorate.c and its calls to the function establish_inner. This is called only for 9 specific element types, namely,
input, td, div, object, span, sup, sub, ovb, and P.
Any elements whose types are in this list will have a correct initial value of innerHTML, but any other element (including document.body) will not have an initial innerHTML.
For more completeness, I would suggest deleting the 6 calls to establish_inner in decorate.c's switch statement, and instead add a catch-all after the final close brace of that switch block, like this:
establish_inner(t->jv, t->value, 0, action == TAGACT_INPUT);
That should make innerHTML work for a lot more elements. It would of course take up more memory and slow us down a little bit, but not as much as PhantomJS, and it would buy compatibility with more Javascripty sites.
Your idea of calling establish_inner after the switch, to cover all tags, seems reasonable, but for caution sake I will wait until we have stamped a new version, which we expect to do in a week or so. After that I'll put this modest yet important change at the top of the list.
Discussions like this should probably be on the developers mailing list Edbrowse-dev@lists.the-brannons.com Not all developers see these github messages, and I'm sure they want to be in the loop.
Karl Dahlke
OK I'll see if I can sign up to that list at some point. Not today though as I am a bit overloaded at the moment. One more thing I should mention though is that edbrowse's support of default innerHTML is also limited by length, and if it is too long it will not be set at all. It's not immediately obvious from the code where this length limit is coming from. What it means is that scripts that try to do "search and replace" on the entire document by accessing a wrapper element's innerHTML will fail unless it is a short test document. It also means we cannot get a DOM tree out of current versions of edbrowse simply by adding a DIV element around the entire body, in case anyone was thinking of trying that.
innerHTML is limited by length,
Fixed. There is a small impact on performance, which I described on list.
Karl Dahlke
Update: in edbrowse 3.7.4, document.body.innerHTML
works (and you can access it via the new jdb
command after loading a page, which essentially takes you into a Javascript console), but innerHTML
does not reflect the DOM changes made by scripts as it does in graphical browsers (for example, if an inline script has called document.write("2+2="+(2+2))
then this will not cause 2+2=4
to appear in document.body.innerHTML
), and outerHTML
remains undefined. And it's difficult to implement your own in jdb
: you can walk through firstChild
and nextSibling
looking at nodeType
, nodeName
and nodeValue
but:
getAttributeNames()
is not yet implemented, and the attributes
property of nodes does not yet define length
. You can use getAttribute()
if you know the attribute name, but you cannot get a list of attribute names, so at best you'll end up having a DOM tree with all the attributes missing.document.write
has been known to break nextSibling
links, which can result in parts of your DOM tree falling off.So as per previous comments on this thread, you can write the original page source to a file, or you can write the final version of the rendered text to a file, but there is not yet a way to write out a with-markup version of the DOM after it has been changed by Javascript.
There are many topics in this thread. This reply only addresses one of them. I implemented element.getAttributeNames, because it was easy to do. I'll look at the other issues later.
Karl Dahlke
I have a query. I want to download a webpage with javascripts using edbrowse to make offline copy. how can i achive this. When i browse that site javascript content is not loaded no text or links are loaded
I'm not sure what you are asking.
If you want a local copy of a web page, with local javascript files and local css files, it is theoretically possible, we do this a lot when debugging, but it's not easy, and has some caveates, and most users don't do that. Call up the debugging page in the edbrowse wiki, and look for the word snapshot.
If you're just saying there's a web page wherein js isn't working properly, well there are a lot of those, let us know which one and we'll add it to the list.
Karl Dahlke
I'm not sure what you are asking. If you want a local copy of a web page, with local javascript files and local css files, it is theoretically possible, we do this a lot when debugging, but it's not easy, and has some caveates, and most users don't do that. Call up the debugging page in the edbrowse wiki, and look for the word snapshot. If you're just saying there's a web page wherein js isn't working properly, well there are a lot of those, let us know which one and we'll add it to the list. Karl Dahlke
i am looking for a way to archive/backup fully javascript dom loaded website for offline backup. edbrowse is only text based browser that supppport javacript. So what should be the command
i am looking for a way to archive/backup fully javascript dom loaded website
This is a complicated question, and we would have to start with some requirements.
There are crawlers that gather all the html files that are directly linked by <a href=> and <iframe src=> tags. google and other search engines have gone beyond this, because some html is brought in dynamically by scripts. I'm guessing you would want something like that.
Sounds like you want more than just the html files, but also to archive the js files.
What about the css files?
what about the json files? In general, json is fetched dynamically, by scripts, which happens also for other scripts and sometimes even html, so this is not unusual, however, json is often timely, like the articles of the day, or other information that is topical, or relevant today but perhaps not tomorrow. Example: nasa.gov presentes only a template then fetches its articles and other things as json files through xhr, and pastes them in place. So I'm guessing json files are not to be archived. That would make it easier.
Then there is the question of archiving files from your website or all files referenced. A website often accesses common libraries from other domains, e.g. css fonts that google provides as public, or jquery libraries that are public. Would you want to archive these off-site files, or just the files that are on the domain of the main html page, on the same web server if you will.
Some javascript, and this is sadly more and more common, uses timers and promise jobs to fetch follow-on html or javascript or json data. So you have to allow those timers to run. In other words, it can never be a single command to edbrowse to do this. You might have to send it commands, with a call to sleep in just the right place, so that the timers can run, and the additional scripts or html can be fetched.A human naturally pauses, until he feels like the page has been fully loaded, but that's hard to automate. Err on the conservative side I suppose, and sleep for 30 seconds, and hope for the best - but edbrowse can be slow at times, and combined with internet delays, sometimes even 30 seconds isn't enough.
As I mentioned, when you think the page is loaded, you can enter the commands jdb and snapshot() and you will have local copies of all the css and js that were used to build that page, plus a jslocal file to map those to the urls where they came from. Or, browse with db3 and scrape the output for javascript source, css source, *redirect, xhr send, and other keywords, capture the urls on those lines, then use curl or wget to download all those files and put them in whatever names and locations you wish.
However you do it, this is just one page. It's not a crawler. I don't follow the <A href=> tags to pull down other pages, and then the javascript that those pages might employ, which is sometimes the same js files and sometimes not. I don't know if you want all the pages that might be reasonably referenced by this page, or just this page.
At the end I think it's a nontrivial development project, for which edbrowse is a great start, but perhaps only a start. I'd need to know more of what you want to do, as per the questions in this writeup, and even then I don't think I have the time to take it on, but I can certainly consult.
Karl Dahlke
Thanks. What i want is to execute the external javascript that was in html src tags and update the DOM accordingly and then scrape the final updated html/DOM
Well that is considerably simpler than the project I was imagining. A script like this might be a start.
( echo showall+ echo "b $1" sleep 30 echo ,p echo q ) | edbrowse > outfile
Then do whatever you wish with the output file.
You'll recognize ,p as the ed command to print the entire page. And q of course to quit.
But there are a lot of caveates. There are still a lot of websites wherein edbrowse doesn't handle the javascript very well. And I already commented on timers updating the page, thus sleep 30 to give the timers time to run. I tested this on http://www.mathreference.com which isn't a great test cause it doesn't use much javascript, but it does use some. You can test it on whatever.com and play around with it.
Karl Dahlke
If you want to back-convert the final rendered DOM into HTML, so for example if the site says <script>var a=document.createElement("a");a.setAttribute("href","http"+":/"+"/"+"www.example.com");a.innerText="hi";document.body.appendChild(a)</script>
and you want the output to be <a href="http://www.example.com">hi</a>
, then I don't think Edbrowse can do this yet. In my Web Adjuster's Javascript execution options, I use Selenium with Headless Chrome or Firefox, but this is quite resource-hungry and slightly unreliable. Maybe one day we'll be able to use Edbrowse for this.
Ok, I think I see where you are headed, and it is quite interesting. I think you could do it from jdb, which is our interactive javascript debugger. After the delay of 30 seconds, which I have already talked about:
jdb document.documentElement.outerHTML ^> rendered_html
This is a standard feature of dom, which I made a tentative first step at implementing. It slightly worked before, I could see it adding the tag in your example, and it works a little better after my latest commit, now bringing in all the attributes. This is largely untested and unused, so if you wanted to play with it, and point out problems, I'd be happy to fix it; some real world js on a web page might depend on this working some day, so it would help if it all worked properly.
Karl Dahlke
That's great. If anyone reading this gets undefined
, note that you need at least version 3.7.5 of edbrowse (check edbrowse -v
), which means if you've installed edbrowse from your distribution's package manager, you need to be running a new enough distribution:
To compile a more recent edbrowse
on the Mac:
sudo port install pcre curl tidy gsed gmake
quickjs
as described in Edbrowse's README
src/Makefile
change else sed -f
to else gsed -f
to ensure the GNU version of sed is called, and remove -latomic
src
, run gmake CFLAGS="-I /opt/local/include"
And yes, some sites do depend on outerHTML, but more depend on innerHTML, and some of them expect innerHTML to be dynamic (like outerHTML currently is). They also expect both innerHTML and outerHTML to have setters which re-parse an HTML fragment and repopulate part of the DOM.
Silas S. Brown wrote on Mon, Nov 15, 2021 at 02:47:19PM -0800:
- Fedora 34 is still stuck on edbrowse 3.7.4 (and it won't install on Fedora 35 because the
libtidy
package is missing; maybe we should file a bug report with Fedora);
hm? There is no edbrowse package on fedora. It looks like there's an external package from "rpmsphere" whatever that is but it's not part of fedora.
libtidy is also very well present, but got a soname update, so that edbrowse package which requires libtidy.so.5 can't find it because fedora 35 provides libtidy.so.58 instead: that external package just needs to be rebuilt for fedora 35, there's nothing wrong with it.
Ideally getting an edbrowse package upstream instead of an external repo would fix all that, need to start packaging quickjs first though...
Yes I realised my comment was wrong (I'd forgotten I'd installed RPM Fusion on the box), so I edited my comment shortly after writing it. But GitHub still sent the wrong version to anyone subscribed to this thread by email. Sorry about that.
Thank you for the comments on edbrowse packaging and various distros. We should continue to "encourage" distros to package and release the latest edbrowse, for the benefit of the average user. They are always quite a bit behind, if they provide it at all, an so much has been added recently. In talking about outerHTML, sure it was added in 3.7.6, but didn't work very well, and I even made some changes to it recently, as per this thread, that aren't in any "version". Folks should try to clone and build edbrowse from source, it isn't hard to do. There are step by step instructions in the wiki, for 32 bit, for 64 bit, for the pi, etc. I am often responsive, fixing bugs and problems quickly, but that only helps if you follow the latest.
Also, we do want to provide static binaries more often, not just on the releases but maybe weekly or some other schedule. We'll keep you up to date if that happens.
Karl Dahlke
Just filed a ticket at MacPorts asking them to update, with a scripted version of the above instructions (they might say "oh that's not how we write our scripts at MacPorts" but hopefully they can adapt it)
- and MacPorts is still stuck on edbrowse 3.4.10.
I've updated edbrowse in MacPorts to 3.8.2.1 and listed myself as the maintainer so I should notice any future versions becoming available and update the port in short order. If I fail to do so please file a MacPorts ticket or send a MacPorts pull request.
I am trying to save a webpage to file: ./edbrowse https://example.com/ > out.htm
no ssl certificate file specified; secure connections cannot be verified 15848 Unable to exec edbrowse-js, javascript has been disabled. 1351
Of course edbrowse-js is in the same directory and has exec rights.