WebWeWant / webwewant.fyi

If you build websites, you inevitably run into problems. Maybe there’s no way to achieve an aspect of your design using CSS. Or maybe there’s a device feature you really wish you could tap into using JavaScript. Or perhaps the in-browser DevTools don’t give you a key insight you need to do your job. We want to hear about it!
https://webwewant.fyi
MIT License
76 stars 23 forks source link

New want: Execute another web page within a secure headless context #44

Closed aarongustafson closed 3 years ago

aarongustafson commented 4 years ago

title: Execute another web page within a secure headless context date: 2020-04-13T01:59:22.839Z submitter: PRIVATE number: 5e93c77a76ae4c1a4ed515ed tags: []

I would like to make a tool to retrieve metadata from websites. Unfortunately, when a page is rendered with JavaScript, like https://twitter.com/webwewantfyi/status/1131998584147070976, the metadata is not available until the JavaScript has been executed.

CORS (https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) allows for sharing resources across origins. However, even with that ability, all that can be retrieved from another site is the raw HTML. While DOMParser (https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) provides a way to parse the HTML into a DOM, it doesn't execute the HTML (or the attached JavaScript). Even if the HTML is completely readable by the other origin, it cannot execute it safely.

Downloaded applications, like Electron apps, could use Puppeteer https://pptr.dev/ without asking any additional permissions, to execute an HTML page and it's dependencies. However, a Progressive Web App is unable to do the same, despite having the browser already at it's disposal.

It would be incredible if the browser provide a "Headless API" that would allow for fetching and executing a resource in a worker thread, and giving the main thread access to the Document (assuming the response is non-opaque), but not the Window.

Since CORS already provides a mechanism to do this without execution, providing a mechanism to do so with execution does not create any additional security concerns. Dependencies would be opaque to the main thread unless the response had the proper CORS headers to be shared.

aarongustafson commented 4 years ago

Would love to get feedback from @melanierichards @scottlow and maybe @thejohnjansen on this too.

aarongustafson commented 3 years ago

@travisleithead @cwilso @captainbrosset Any thoughts on who we might want to loop in on this?

travisleithead commented 3 years ago

Hmm. Pretty heavyweight request just to get metadata (e.g., is the whole Document needed? Exactly what metadata is being requested? Document content? And at what point in the page's lifecycle?). Perhaps HTML Modules could be a solution, though the script execution semantics might break sites intended to be loaded as top-level.

I have a wealth of other concerns, but need clarity on the use cases before anything else :).

captainbrosset commented 3 years ago

Like Travis said, it'd be great to know more details about the use case at hand, and what metadata the person is looking to access. Do we have a mean of contacting them?

Looking at the title: "execute another web page within a secure headless context" sounds a bit odd to me. Executing web pages within secure contexts is what browsers do, and being able to access that context from another origin defeats the purpose of being securely isolated.

If the person is having in mind a way to access any web page, execute it, and gather data from it, then a desktop app using selenium, puppeteer or playwright is the way to go. If the person wants to do this within the browser, then a browser extension would do the trick.

If, however, the person wants a way to access web pages they have control over, execute them, and gather data from them, I don't really see a way around using proper CORS headers, parsing the HTML, finding dependencies, loading them, and putting everything in a hidden iframe to take care of the execution part. Or just loading the site in an iframe and communicating with cross-frame message events.

@captainbrosset Any thoughts on who we might want to loop in on this?

Sorry, not sure who to loop in on this without knowing more.

guest271314 commented 3 years ago

Can "metadata" be clearly defined?

guest271314 commented 3 years ago

I would like to make a tool to retrieve metadata from websites.

The term "metadata" can be interpreted narrowly or broadly, see the film A Good American.

Headless API can take screenshots. getDisplayMedia() can take screenshots. Chromium at Linux cannot capture system audio, Firefox can. Using Native Messaging and, or other API's it is possible to use a local server to fetch external resources, load the resources in a new browser window and get networking, video, and if applicable, audio data from the site, then close the new browser instance.

Am not certain what the specific requirement is?

davidbarratt commented 3 years ago

Ooops. I didn't mean to mark this request as private. Apologies, I made the request. :)

I've been expirmenting with a web-based RSS (or other Structured Data) reader: @chickaree. As an example, a user can load any domain and the web app (from the client) will gather the metadata: https://chickar.ee/www.nytimes.com This prompted a different request to be able to bypass CORS for anon read requests which would also bennefit apps like @hoppscotch.

I have configured CORS on my server so you can see what that does without a proxy: https://chickar.ee/davidwbarratt.com or on a specific article: https://chickar.ee/davidwbarratt.com/bm9kZS80NQ

This all works well, because the server responds to the XHR with HTML that contains the metadata (anything in the <head>) of the document.

However, this breaks down completely when an HTML page is almost entirely generated on the client and does not contain any useful metadata in the <head>, as an example: https://chickar.ee/twitter.com/d2Vid2V3YW50ZnlpL3N0YXR1cy8xMTMxOTk4NTg0MTQ3MDcwOTc2 which should give us at least the <title> of the page from: https://twitter.com/webwewantfyi/status/1131998584147070976

The Web We Want on Twitter: "If you could wave a magic wand and address a limitation of the web platform or DevTools, what would it be? Share your ideas with us! https://t.co/wBdrwuz7Vl" / Twitter

but it cannot because I can't find a way to securly execute the JavaScript from the source provided by the user.

I could load the page in an iframe, but since it is cross-origin, I can't read the contents of the iframe (even if the CORS headers allow the content of the page to be read). I guess I could setup a proxy for the iframe (with the sandbox attribute) so it will be the same origin, but this will also need to rewrite any relative script URLs (which may be impossible if the script is loaded with another script, ugh).

If this were a client application, there would be several ways to securely execute a page's JavaScript at any URL and get the content of that page. This feature should obviously only work if the server responds with the proper CORS headers, but if I can already read the content of the page (because of CORS), then why isn't there a way to execute that pages' JavaScript securely (i.e. without modifying/reading/knowing the page the user is actually on?)

I guess to simplify the request, it would be that CORS headers would apply to iframes (or maybe new/similar headers?). This way I could load the page in a hidden iframe, get the metadata and then destory it. It would be nice if you didn't even have to load the iframe into the DOM though and you could execute it from a worker, which would also prevent the browser from trying to load images or stylesheets which we don't care about in this context.

I imagine a feature like this could be helpful for other types of apps that run testing on user-provided URLs.

I hope this helps!

davidbarratt commented 3 years ago

I guess another solution is that I could run a server somewhere with something like @puppeteer doing the leg work to generate HTML from the client page, but this seems like a ton of overhead for something that the client's browser already knows how to do.

Another problem, is that it's rather difficult to determine if a page needs to be "executed" or if the HTML can just be read. Therefore everything would have to be read initially and then executed in the background. A worker in the browser seems like the ideal way to do this, if it were possible.

guest271314 commented 3 years ago

Can you explain what

However, this breaks down completely when an HTML page is almost entirely generated on the client

Not gathering what is meant here.

but it cannot because I can't find a way to securly execute the JavaScript from the source provided by the user

Are you just trying to get <head> in HTML loaded in an arbitrary document?

You might have surmised or accurated estimated when to get generated HTML from the sites mentioned, though are you expecting that same result at any website?

If this were a client application, there would be several ways to securely execute a page's JavaScript at any URL and get the content of that page.

Does not running a browser in headless mode or using Puppeteer already provide a means to do this?

Are you able to achieve the requirement using curl, wget, Netcat or other native application at the command line or running a shell script?

guest271314 commented 3 years ago

@davidbarratt Are you trying to intercept arbitrary JavaScript coe executed in the page?

guest271314 commented 3 years ago

What is the specificity of the metadata required? Title of the document? Text and, or media in the document?

If the sites were cooperative with the requirement HTML Microdata provides machine readable data embedded in the document, which can be parsed https://stackoverflow.com/a/30201828. However, it does not appear that the sites are common or have common markup or scripts which render data in the document. Which leads backs to what is the exact metadata that is trying to be retrieved from the arbitrary web site?

davidbarratt commented 3 years ago

In the example I provided: https://twitter.com/webwewantfyi/status/1131998584147070976

If you load the page without Javascript, or with curl or wget you'll get a relevant <head> like this:

<meta property="og:site_name" content="Twitter" />

If you load the page with JavaScript enabled you get a relevant <head> like this:

<meta property="og:site_name" content="Twitter">
<link rel="canonical" href="https://twitter.com/webwewantfyi/status/1131998584147070976">
<link rel="mask-icon" sizes="any" href="https://abs.twimg.com/responsive-web/client-web/icon-svg.9e211f65.svg" color="#1da1f2">
<meta content="https://twitter.com/webwewantfyi/status/1131998584147070976" property="og:url" data-rh="true">
<meta content="The Web We Want on Twitter" property="og:title" data-rh="true">
<meta content="https://pbs.twimg.com/profile_images/1130909151121383424/i5im6eMc_normal.png" property="og:image" data-rh="true">
<meta content="If you could wave a magic wand and address a limitation of the web platform or DevTools, what would it be? Share your ideas with us! https://t.co/wBdrwuz7Vl" property="og:description" data-rh="true">

Are you just trying to get in HTML loaded in an arbitrary document?

Yes.

Does not running a browser in headless mode or using Puppeteer already provide a means to do this?

It does. However you cannot run run the browser in headless mode from the browser itself, which is what I am asking for.

Are you able to achieve the requirement using curl, wget, Netcat or other native application at the command line or running a shell script?

No since programs like curl do not execute a page's JavaScript.

Are you trying to intercept arbitrary JavaScript coe executed in the page?

No? I want the page to execute in "headless" mode, securely, from another page. Then get the "result" of that page (i.e. the modified HTML)

What is the specificity of the metadata required? Title of the document? Text and, or media in the document?

I'm working with @schemaorg metadata, but I also fallback to Open Graph metadata. I'm open to any metadata standard that people want to use to markup their documents.

If the sites were cooperative with the requirement HTML Microdata provides machine readable data embedded in the document, which can be parsed https://stackoverflow.com/a/30201828. However, it does not appear that the sites are common or have common markup or scripts which render data in the document. Which leads backs to what is the exact metadata that is trying to be retrieved from the arbitrary web site?

There isn't a requirement that a page provide this metadata without executing some JavaScript. As an example, the Facebook Sharing Debugger is able to execute the JavaScript of the example provided and return the metadata from the page: https://developers.facebook.com/tools/debug/?q=https%3A%2F%2Ftwitter.com%2Fwebwewantfyi%2Fstatus%2F1131998584147070976 even though, that metada is not availble without executing the page's JavaScript.

Google's Structured Data Testing Tool, does not find any @schemaorg metadata, but does return the full (JavaScript Rendered) HTML page: https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Ftwitter.com%2Fwebwewantfyi%2Fstatus%2F1131998584147070976

My ask is that the browser would be able to securely perform "Headless" actions from within a web page. Why should the Facebook Sharing Debugger or the Google Structured Data Testing Tool need to rely on a server in order to do this, when the browser is capable of doing it already?

davidbarratt commented 3 years ago

I can imagine an API like Worker where it would be something like:

const myHeadless = new Headless('/path/to/document.html');

which make a request to the document in a new thread and execute it (assuming it complied with the same-origin policy or had the appropriate CORS headers). Then there would be methods to do things like wait for network idle, access the Document, etc.

aarongustafson commented 3 years ago

Ooops. I didn't mean to mark this request as private. Apologies, I made the request. :)

That's our system. When the forms come in, we strip the names to maintain privacy just in case the prison wants to keep that info private. 🙂

guest271314 commented 3 years ago

If you load the page without Javascript, or with curl or wget you'll get a relevant like this:

That is the prerogative of the site, which can decide to include the relevant machine readable markup in their source code at any time.

This is what got here loading the page with JavaScript enabled

<head>
        <meta name="HandheldFriendly" content="True">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0">
    <link rel="canonical" href="https://twitter.com/webwewantfyi/status/1131998584147070976">
    <meta name="twitter-redirect-url" content="twitter://status?status_id=1131998584147070976">
    <meta name="twitter-redirect-srcs" content="{&quot;pwreset-iphone&quot;:true,&quot;android&quot;:true,&quot;email&quot;:true}">
    <link href="https://ma.twimg.com/twitter-mobile/a8491f04da4e54be98cff3a86c58f23a770c26bd/images/favicon.ico" rel="icon" type="image/x-icon">
    <title>Twitter</title>
      <link href="https://ma.twimg.com/twitter-mobile/a8491f04da4e54be98cff3a86c58f23a770c26bd/assets/as.css" inline="false" media="screen" rel="stylesheet" type="text/css">
    <script async="" src="https://www.google-analytics.com/analytics.js"></script><script src="https://ma.twimg.com/twitter-mobile/a8491f04da4e54be98cff3a86c58f23a770c26bd/javascripts/framebust.js" type="text/javascript"></script>
    <script src="https://ma.twimg.com/twitter-mobile/a8491f04da4e54be98cff3a86c58f23a770c26bd/javascripts/serviceworker.js" type="text/javascript" async="" defer=""></script>
    <meta name="google-site-verification" content="V0yIS0Ec_o3Ii9KThrCoMCkwTYMMJ_JYx_RSaGhFYvw">

  </head>

disabled

<head>
        <meta name="HandheldFriendly" content="True">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0">
    <link rel="canonical" href="https://twitter.com/webwewantfyi/status/1131998584147070976">
    <meta name="twitter-redirect-url" content="twitter://status?status_id=1131998584147070976">
    <meta name="twitter-redirect-srcs" content="{&quot;pwreset-iphone&quot;:true,&quot;android&quot;:true,&quot;email&quot;:true}">
    <link href="https://ma.twimg.com/twitter-mobile/a8491f04da4e54be98cff3a86c58f23a770c26bd/images/favicon.ico" rel="icon" type="image/x-icon">
    <title>Twitter</title>
      <link href="https://ma.twimg.com/twitter-mobile/a8491f04da4e54be98cff3a86c58f23a770c26bd/assets/as.css" inline="false" media="screen" rel="stylesheet" type="text/css">
    <script async="" src="https://www.google-analytics.com/analytics.js"></script><script src="https://ma.twimg.com/twitter-mobile/a8491f04da4e54be98cff3a86c58f23a770c26bd/javascripts/framebust.js" type="text/javascript"></script>
    <script src="https://ma.twimg.com/twitter-mobile/a8491f04da4e54be98cff3a86c58f23a770c26bd/javascripts/serviceworker.js" type="text/javascript" async="" defer=""></script>
    <meta name="google-site-verification" content="V0yIS0Ec_o3Ii9KThrCoMCkwTYMMJ_JYx_RSaGhFYvw">

  </head>

Would suggest to capture what the site actually serves by default to store actual raw data, instead of attempting to massage the sites' implicit metadata that the site is deliberately withholding in lieu of some user action - the semantic web should be semantic, not based on coercing user action just to get the textual, machine readable data that the site should serve by default in their markup.

It does. However you cannot run run the browser in headless mode from the browser itself, which is what I am asking for.

Yes, you can. Using Native Messaging or QuicTransport(), e.g., see https://github.com/guest271314/captureSystemAudio#stream-file-being-written-at-local-filesystem-to-mediasource-capture-as-mediastream-record-with-mediarecorder-in-real-time where we stream STDOUT from a shell script to the browser context and the code at https://github.com/WebAudio/web-audio-api-v2/issues/97 where we turn on and off a local server that extension code can fetch() from at any origin and stream STDOUT to the browser context using fetch() with AbortController, where we ca can, if necessary, launch Nightly from Chromium using including but not limited to the above patterns at https://gist.github.com/guest271314/04a539c00926e15905b86d05138c113c and terminate Nightly, or a new instance of Chromium, or both, when the task is complete.

No? I want the page to execute in "headless" mode, securely, from another page. Then get the "result" of that page (i.e. the modified HTML)

There really is no such thing as "securely" when the subject matter is the web. See the link to the film A Good American, above. The requirement is achievable right now, without being tested and verified as "securely".

Am currently testing QuicTransport() which also offers the ability to execute arbitrary code on the machine, and get STDOUT back as a stream.

That does not necessarily solve the problem statement here though, which is to wait, or do some user activity for some HTML to be insterted into the document. What user action? What part of the site executes which JavaScript function?

I'm open to any metadata standard that people want to use to markup their documents.

schema.org is extensible. You can define your own standards and definitions, within and without schema, e.g., https://github.com/guest271314/Definition.

Meaning, the problem is that sites are including what has been presented as metadata as results of JavaScript functions or user activity, if I am reading the requirement correctly, which means those sites are acting contrary to the purpose of semanic web priciples and including machine readable data in the site at all. Sites should not be rewarded for that by developers creating workarounds just to load their JavaScript. Instead, an alternative approach is to just report the facts as they are, and by doing so, along with notifying the sites that you are requesting content from, perhaps make them aware without excuse that their sites need to serve machine readable data in the default markup without JavaScript necessary, or developers will not be going out of their way to run their unrelated code.

guest271314 commented 3 years ago

If Twitter can ban an user for no reason other than posting primary source documents then ask for a copy of the users' government issued identification https://github.com/guest271314/banned#show-me-your-papers, essentially acting as a government agent at that point (what other data would they be comparing such requested PII data to given never was asked for a gov'ment issued identification to sign up for Twitter?) when asked the site why was banned - that site can take the time to include at least basic HTML Microdata in their default markup - and that goes for any site. Basic Microdata is just that, basic, and can be as elaborate as the site wants. Otherwise, it is the duty of the developer to report the site as it is, else the site might not ever get around to marking up their site with machine readable data, if developers jump through their bells and whistles without correctly pointing out the fixable deficiencies.

guest271314 commented 3 years ago

@davidbarratt An example of using QuicTransport https://github.com/guest271314/quictransport/tree/main/speech-synthesis (see https://github.com/GoogleChrome/samples/tree/gh-pages/quictransport) to execute espeak-ng installed on the OS and get stdout in the browser. You should be able to adjust the code to perform arbitrary tasks, including launching a separate instance of Chromium or Nightly (be sure to use --user-data-dir=/path/to/dir and create a separate profile at Mozill browsers firefox -CreateProfile "profile_name") to realize the requirement described at OP, that is, launch a seaparate browser instance, and try to get data from the document. You might still run into issues due to CSP (am able to run the code at any site that does not block requests other than to self, which we can probably work around). If you have any issues or questions during attempts do not hesistate to give a ping and will test what you have put together.

guest271314 commented 3 years ago

A relevant resource: https://developers.google.com/web/tools/puppeteer/articles/ssr

document.head.outerHTML of the URL using --repl. Am not sure if the output is what is expected.

$ ~/chrome-linux/chrome-wrapper --repl --password-store=basic --user-data-dir=/tmp --headless --disable-gpu --crash-dumps-dir=./tmp https://twitter.com/webwewantfyi/status/1131998584147070976
[1027/023045.705710:INFO:headless_shell.cc(448)] Type a Javascript expression to evaluate or "quit" to exit.
>>> document.head.outerHTML
{"result":{"type":"string","value":"\u003Chead>\u003Cmeta charset=\"utf-8\">\n\u003Cmeta name=\"viewport\" content=\"width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover\">\n\u003Clink rel=\"preconnect\" href=\"//abs.twimg.com\">\n\u003Clink rel=\"preconnect\" href=\"//api.twitter.com\">\n\u003Clink rel=\"preconnect\" href=\"//pbs.twimg.com\">\n\u003Clink rel=\"preconnect\" href=\"//t.co\">\n\u003Clink rel=\"preconnect\" href=\"//video.twimg.com\">\n\u003Clink rel=\"dns-prefetch\" href=\"//abs.twimg.com\">\n\u003Clink rel=\"dns-prefetch\" href=\"//api.twitter.com\">\n\u003Clink rel=\"dns-prefetch\" href=\"//pbs.twimg.com\">\n\u003Clink rel=\"dns-prefetch\" href=\"//t.co\">\n\u003Clink rel=\"dns-prefetch\" href=\"//video.twimg.com\">\n\u003Clink rel=\"preload\" as=\"script\" crossorigin=\"anonymous\" href=\"https://abs.twimg.com/responsive-web/client-web/polyfills.06981235.js\" nonce=\"\">\n\u003Clink rel=\"preload\" as=\"script\" crossorigin=\"anonymous\" href=\"https://abs.twimg.com/responsive-web/client-web/vendors~main.aee47a35.js\" nonce=\"\">\n\u003Clink rel=\"preload\" as=\"script\" crossorigin=\"anonymous\" href=\"https://abs.twimg.com/responsive-web/client-web/i18n/en.c41c06d5.js\" nonce=\"\">\n\u003Clink rel=\"preload\" as=\"script\" crossorigin=\"anonymous\" href=\"https://abs.twimg.com/responsive-web/client-web/main.31523a25.js\" nonce=\"\">\n\u003Cmeta property=\"fb:app_id\" content=\"2231777543\">\n\u003Cmeta property=\"og:site_name\" content=\"Twitter\">\n\u003Cmeta name=\"google-site-verification\" content=\"V0yIS0Ec_o3Ii9KThrCoMCkwTYMMJ_JYx_RSaGhFYvw\">\n\u003Clink rel=\"manifest\" href=\"/manifest.json\" crossorigin=\"use-credentials\">\n\u003Clink rel=\"alternate\" hreflang=\"x-default\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976\">\n\u003Clink rel=\"alternate\" hreflang=\"ar\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=ar\">\n\u003Clink rel=\"alternate\" hreflang=\"bg\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=bg\">\n\u003Clink rel=\"alternate\" hreflang=\"bn\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=bn\">\n\u003Clink rel=\"alternate\" hreflang=\"ca\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=ca\">\n\u003Clink rel=\"alternate\" hreflang=\"cs\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=cs\">\n\u003Clink rel=\"alternate\" hreflang=\"da\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=da\">\n\u003Clink rel=\"alternate\" hreflang=\"de\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=de\">\n\u003Clink rel=\"alternate\" hreflang=\"el\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=el\">\n\u003Clink rel=\"alternate\" hreflang=\"en\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=en\">\n\u003Clink rel=\"alternate\" hreflang=\"en-GB\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=en-GB\">\n\u003Clink rel=\"alternate\" hreflang=\"es\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=es\">\n\u003Clink rel=\"alternate\" hreflang=\"eu\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=eu\">\n\u003Clink rel=\"alternate\" hreflang=\"fa\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=fa\">\n\u003Clink rel=\"alternate\" hreflang=\"fi\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=fi\">\n\u003Clink rel=\"alternate\" hreflang=\"tl\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=tl\">\n\u003Clink rel=\"alternate\" hreflang=\"fr\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=fr\">\n\u003Clink rel=\"alternate\" hreflang=\"ga\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=ga\">\n\u003Clink rel=\"alternate\" hreflang=\"gl\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=gl\">\n\u003Clink rel=\"alternate\" hreflang=\"gu\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=gu\">\n\u003Clink rel=\"alternate\" hreflang=\"he\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=he\">\n\u003Clink rel=\"alternate\" hreflang=\"hi\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=hi\">\n\u003Clink rel=\"alternate\" hreflang=\"hr\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=hr\">\n\u003Clink rel=\"alternate\" hreflang=\"hu\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=hu\">\n\u003Clink rel=\"alternate\" hreflang=\"id\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=id\">\n\u003Clink rel=\"alternate\" hreflang=\"it\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=it\">\n\u003Clink rel=\"alternate\" hreflang=\"ja\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=ja\">\n\u003Clink rel=\"alternate\" hreflang=\"kn\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=kn\">\n\u003Clink rel=\"alternate\" hreflang=\"ko\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=ko\">\n\u003Clink rel=\"alternate\" hreflang=\"mr\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=mr\">\n\u003Clink rel=\"alternate\" hreflang=\"ms\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=ms\">\n\u003Clink rel=\"alternate\" hreflang=\"nb\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=nb\">\n\u003Clink rel=\"alternate\" hreflang=\"nl\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=nl\">\n\u003Clink rel=\"alternate\" hreflang=\"pl\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=pl\">\n\u003Clink rel=\"alternate\" hreflang=\"pt\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=pt\">\n\u003Clink rel=\"alternate\" hreflang=\"ro\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=ro\">\n\u003Clink rel=\"alternate\" hreflang=\"ru\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=ru\">\n\u003Clink rel=\"alternate\" hreflang=\"sk\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=sk\">\n\u003Clink rel=\"alternate\" hreflang=\"sr\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=sr\">\n\u003Clink rel=\"alternate\" hreflang=\"sv\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=sv\">\n\u003Clink rel=\"alternate\" hreflang=\"ta\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=ta\">\n\u003Clink rel=\"alternate\" hreflang=\"th\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=th\">\n\u003Clink rel=\"alternate\" hreflang=\"tr\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=tr\">\n\u003Clink rel=\"alternate\" hreflang=\"uk\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=uk\">\n\u003Clink rel=\"alternate\" hreflang=\"ur\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=ur\">\n\u003Clink rel=\"alternate\" hreflang=\"vi\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=vi\">\n\u003Clink rel=\"alternate\" hreflang=\"zh\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=zh\">\n\u003Clink rel=\"alternate\" hreflang=\"zh-Hant\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976?lang=zh-Hant\">\n\u003Clink rel=\"canonical\" href=\"https://twitter.com/webwewantfyi/status/1131998584147070976\">\n\u003Clink rel=\"search\" type=\"application/opensearchdescription+xml\" href=\"/opensearch.xml\" title=\"Twitter\">\n\u003Clink rel=\"mask-icon\" sizes=\"any\" href=\"https://abs.twimg.com/responsive-web/client-web/icon-svg.9e211f65.svg\" color=\"#1da1f2\">\n\u003Clink rel=\"shortcut icon\" href=\"//abs.twimg.com/favicons/twitter.ico\" type=\"image/x-icon\">\n\u003Clink rel=\"apple-touch-icon\" sizes=\"192x192\" href=\"https://abs.twimg.com/responsive-web/client-web/icon-ios.8ea219d5.png\">\n\u003Cmeta name=\"mobile-web-app-capable\" content=\"yes\">\n\u003Cmeta name=\"apple-mobile-web-app-title\" content=\"Twitter\">\n\u003Cmeta name=\"apple-mobile-web-app-status-bar-style\" content=\"white\">\n\u003Cmeta name=\"theme-color\" content=\"#FFFFFF\">\n\u003Cmeta http-equiv=\"origin-trial\" content=\"Apir4chqTX+4eFxKD+ErQlKRB/VtZ/dvnLfd9Y9Nenl5r1xJcf81alryTHYQiuUlz9Q49MqGXqyaiSmqWzHUqQwAAABneyJvcmlnaW4iOiJodHRwczovL3R3aXR0ZXIuY29tOjQ0MyIsImZlYXR1cmUiOiJDb250YWN0c01hbmFnZXIiLCJleHBpcnkiOjE1NzUwMzUyODMsImlzU3ViZG9tYWluIjp0cnVlfQ==\">\n\n\u003Cstyle>html,body{height: 100%;}body{-ms-overflow-style:scrollbar;overflow-y:scroll;overscroll-behavior-y:none;}\u003C/style>\n\u003Cstyle id=\"react-native-stylesheet\">[stylesheet-group=\"0\"]{}\nhtml{-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%;-webkit-tap-highlight-color:rgba(0,0,0,0);}\nbody{margin:0;}\nbutton::-moz-focus-inner,input::-moz-focus-inner{border:0;padding:0;}\ninput::-webkit-inner-spin-button,input::-webkit-outer-spin-button,input::-webkit-search-cancel-button,input::-webkit-search-decoration,input::-webkit-search-results-button,input::-webkit-search-results-decoration{display:none;}\n[stylesheet-group=\"0.1\"]{}\n:focus:not([data-focusvisible-polyfill]){outline: none;}\n[stylesheet-group=\"1\"]{}\n.css-1dbjc4n{-ms-flex-align:stretch;-ms-flex-direction:column;-ms-flex-negative:0;-ms-flex-preferred-size:auto;-webkit-align-items:stretch;-webkit-box-align:stretch;-webkit-box-direction:normal;-webkit-box-orient:vertical;-webkit-flex-basis:auto;-webkit-flex-direction:column;-webkit-flex-shrink:0;align-items:stretch;border:0 solid black;box-sizing:border-box;display:-webkit-box;display:-moz-box;display:-ms-flexbox;display:-webkit-flex;display:flex;flex-basis:auto;flex-direction:column;flex-shrink:0;margin-bottom:0px;margin-left:0px;margin-right:0px;margin-top:0px;min-height:0px;min-width:0px;padding-bottom:0px;padding-left:0px;padding-right:0px;padding-top:0px;position:relative;z-index:0;}\n.css-901oao{border:0 solid black;box-sizing:border-box;color:rgba(0,0,0,1.00);display:inline;font:14px system-ui,-apple-system,BlinkMacSystemFont,\"Segoe UI\",Roboto,Ubuntu,\"Helvetica Neue\",sans-serif;margin-bottom:0px;margin-left:0px;margin-right:0px;margin-top:0px;padding-bottom:0px;padding-left:0px;padding-right:0px;padding-top:0px;white-space:pre-wrap;word-wrap:break-word;}\n.css-16my406{color:inherit;font:inherit;white-space:inherit;}\n[stylesheet-group=\"2\"]{}\n.r-13awgt0{-ms-flex-negative:1;-ms-flex-positive:1;-ms-flex-preferred-size:0%;-webkit-box-flex:1;-webkit-flex-basis:0%;-webkit-flex-grow:1;-webkit-flex-shrink:1;flex-basis:0%;flex-grow:1;flex-shrink:1;}\n.r-4qtqp9{display:inline-block;}\n.r-ywje51{margin-bottom:auto;margin-left:auto;margin-right:auto;margin-top:auto;}\n.r-hvic4v{display:none;}\n.r-1adg3ll{display:block;}\n[stylesheet-group=\"2.2\"]{}\n.r-12vffkv>*{pointer-events:auto;}\n.r-12vffkv{pointer-events:none!important;}\n.r-14lw9ot{background-color:rgba(255,255,255,1.00);}\n.r-1p0dtai{bottom:0px;}\n.r-1d2f490{left:0px;}\n.r-1xcajam{position:fixed;}\n.r-zchlnj{right:0px;}\n.r-ipm5af{top:0px;}\n.r-yyyyoo{fill:currentcolor;}\n.r-1xvli5t{height:1.25em;}\n.r-dnmrzs{max-width:100%;}\n.r-bnwqim{position:relative;}\n.r-1plcrui{vertical-align:text-bottom;}\n.r-lrvibr{-moz-user-select:none;-ms-user-select:none;-webkit-user-select:none;user-select:none;}\n.r-13gxpu9{color:rgba(29,161,242,1.00);}\n.r-wy61xf{height:72px;}\n.r-u8s1d{position:absolute;}\n.r-1blnp2b{width:72px;}\n.r-1ykxob0{top:60%;}\n.r-1b2b6em{line-height:2em;}\n.r-q4m81j{text-align:center;}\u003C/style>\n\n\n\u003Cscript charset=\"utf-8\" src=\"https://abs.twimg.com/responsive-web/client-web/sharedCore.4523e665.js\">\u003C/script>\u003Cscript charset=\"utf-8\" src=\"https://abs.twimg.com/responsive-web/client-web/ondemand.Dropdown.1e9bc215.js\">\u003C/script>\u003Cscript charset=\"utf-8\" src=\"https://abs.twimg.com/responsive-web/client-web/loader.AppModules.eec32db5.js\">\u003C/script>\u003Cscript charset=\"utf-8\" src=\"https://abs.twimg.com/responsive-web/client-web/loader.SideNav.de4cd0c5.js\">\u003C/script>\u003Cscript charset=\"utf-8\" src=\"https://abs.twimg.com/responsive-web/client-web/bundle.Conversation.892a9fe5.js\">\u003C/script>\u003Cscript charset=\"utf-8\" src=\"https://abs.twimg.com/responsive-web/client-web/bundle.NetworkInstrument.2234ae85.js\">\u003C/script>\u003Ctitle>Twitter\u003C/title>\u003Cmeta content=\"Twitter\" property=\"og:title\" data-rh=\"true\">\u003Cscript charset=\"utf-8\" src=\"https://abs.twimg.com/responsive-web/client-web/ondemand.BranchSdk.0b1f40b5.js\">\u003C/script>\u003Cscript charset=\"utf-8\" src=\"https://abs.twimg.com/responsive-web/client-web/ondemand.emoji.en.3499cef5.js\">\u003C/script>\u003C/head>"}}
guest271314 commented 3 years ago

FWIW a semantic, machine-readable HTML document (HTML Microdata and VCard) that does not rely on JavaScript or other code execution could look something like the markup at the end of https://stackoverflow.com/revisions/9d54b768-a2cd-466f-9fbc-e03d4d610f94/view-source, which when parsed with the code included at the link (using jQuery) or the JavaScipt (without jQuery) answer at the question, which can be expanded to also parse data- attributes and <data> elements, yields

{
  "item": [
    [
      "itemscope itemtype itemid itemref",
      "itemscopehttp://schema.org/docs/full.html http://n.whatwg.org/workhttp://get-microdata-example.html"
    ],
    [
      "author",
      "by guest271314 November 1, 2012"
    ],
    [
      "datePublished",
      "November 1, 2012"
    ],
    [
      "name",
      "get-microdata.js µ."
    ],
    [
      "encodesCreativeWork",
      "µ"
    ],
    [
      "author publisher",
      "guest271314 @guest271314."
    ],
    [
      "replyToUrl",
      "@guest271314"
    ],
    [
      "releaseNotes",
      "homeget-microdata.js - Documentation."
    ],
    [
      "dateCreated",
      "2012-06-01."
    ],
    [
      "datePublished",
      "2012-11-01."
    ],
    [
      "dateModified",
      "2012-11-01."
    ],
    [
      "softwareVersion",
      "1.0."
    ],
    [
      "about",
      "Get and display microdata in HTML document."
    ],
    [
      "description featureList",
      "Outlines and returns microdata items, names, properties, and values and in HTML document."
    ],
    [
      "audience",
      "Authors, publishers, developers."
    ],
    [
      "browserRequirements",
      "jQuery."
    ],
    [
      "fileFormat encodings",
      "Type: application/javascript; charset: utf-8."
    ],
    [
      "applicationCategory",
      "HTML5 Microdata; jQuery; JavaScript."
    ],
    [
      "fileSize",
      "18921 bytes."
    ],
    [
      "processorRequirements",
      "Place 'get-microdata-min' src='get-microdata-min.js' beforetag in HTML document."
    ],
    [
      "publishingPrinciples",
      "The software is developed in the hope that it will be useful."
    ],
    [
      "applicationSubCategory",
      "Single Page Application."
    ],
    [
      "work downloadUrl",
      "See get-microdata-min.js - View source."
    ],
    [
      "license",
      "Copyright (C) 2012, 2015 guest271314 @guest271314 All rights reserved. MIT license"
    ],
    [
      "breadcrumb associatedArticle",
      "http://www.whatwg.org/specs/web-apps/current-work/#microdataaboutHTML Standard - 5 Microdata"
    ],
    [
      "breadcrumb associatedArticle",
      "http://en.wikipedia.org/wiki/Microdata_(HTML)aboutWikipedia - Microdata (HTML)"
    ],
    [
      "breadcrumb associatedArticle",
      "http://schema.org/docs/full.htmlaboutschema.org - The Type Heirarchy"
    ],
    [
      "breadcrumb associatedArticle",
      "http://en.wikipedia.org/wiki/Single_page_applicationWikipedia - Single-page application"
    ],
    [
      "copyrightHolder",
      "Copyright (C) 2012, 2015 guest271314"
    ],
    [
      "datePublished",
      "2012-06-26"
    ],
    [
      "dateModified",
      "2012-11-01"
    ]
  ]
}

More elaborate examples include the ability to export the VCard and, or Microdata from the document - without necessarily using scripting.

github-actions[bot] commented 3 years ago

This has gotten stale. Take a look or close it out.

guest271314 commented 3 years ago

See https://bugs.chromium.org/p/chromium/issues/detail?id=1131236#c43