N0taN3rd / node-warc

Parse And Create Web ARChive (WARC) files with node.js
MIT License
92 stars 20 forks source link

Capturing two URLs are not being properly read by Webrecorder Player? #27

Open hanoii opened 5 years ago

hanoii commented 5 years ago

I successfully (I think) captured and generated a warc file using https://electronjs.org/docs/api/debugger.

I tried a simple site: www.drupal.org

If I capture the first load, it seems to work nicely, Webrecorder Player shows it perfect.

However if I navigate to "Developers" and then store both the homepage and this page into the warc file, it doesn't seem to work. I see the data on the warcfile though.

I guess something is missing on the Warc file or I am missing something, any ideas?

Other than that, I am super happy of seeing this working. Might even worth contributing this warc generator into this package.

N0taN3rd commented 5 years ago

node-warc welcomes all contributions!

My guess is that you are not writing a single warc info record using writeWebrecorderBookmarksInfoRecord that contains all all the URLs of the pages you wish to be viewable via WR player

To fix that you can wait to append that record till the very end of capturing all the pages or view them using pywb which has no such restriction. Ultimately WR player and WR itself use pywb as the replay system

hanoii commented 5 years ago

Hmm, oddly I also tried pywb, but it didn't display anything. Will look. I am basically just capturing everything from a webview tag, and I navigated just to one URL, and then store all packages to a warcfile almost the same as with the remoteChromeGenerator

N0taN3rd commented 5 years ago

Have you tried to use puppeteer rather than electron? I have found that using a full browser either Chrome or Chromium (brought in via puppeteer) controllable via puppeteer or chrome-remote-interface produces better results and is easier to use.

N0taN3rd commented 5 years ago

Ultimately the best advice I can give without seeing how you are doing the capturing (either src or minimal working example) is to treat each page as a standalone WARC that is either appended to a single WARC or written to its own WARC with concatenation done afterwards.

hanoii commented 5 years ago

If you can, here's my source:

https://pastebin.com/61bBUiyg

I'll eventually wrap this better, for now is a PoC.

I took out your RemoteChromeWARCGenerator and RemoteChromeRequestCapturer, change the network interface for Electron's Debugger which gave me access to the same events. So it should be basically the same.

The writing of the warc file is as per your example for chrome on the project's page.

I only tried puppeteer for a quick test, might do some better one next week but I would have expected to work.

N0taN3rd commented 5 years ago

Did the electron request capturer and writer not work for you?

hanoii commented 5 years ago

😱 I didn't see them or knew they were there! Sorry. Quick look at the code looks like I ended up doing something very similar.

Will try it anyway to see if I get the same Warc.

I am not capturing maybeNetworkMessage though.

This is the warc file I got: warc.zip

It should have both https://www.drupal.org/ and https://www.drupal.org/developers

I see them on the warc file

Will try yours anyway and see what I do. Thanks, might get back properly next week.

N0taN3rd commented 5 years ago

maybeNetworkMessage is a utility function in order to allow you to not have to add an additional message listener to the debugger :smile: As far as your shared src code I can not infer when you are writing to the WARC and from what I can infer from the discussion here when that is being done is likely the reason for your issues.

hanoii commented 5 years ago

I am doing that manually on a context menu, so basically I just wait a reasonable while and trigger it:

    const menuItem2 = new MenuItem({
      label: 'Warc it yo!',
      click: (menuItem, browserWindow, event) => {
        const warcGen = new DebuggerWARCGenerator()
        console.log(cap)
        warcGen.generateWARC(cap, debug, {
          warcOpts: {
            warcPath: 'myWARC.warc'
          },
          winfo: {
            description: 'I created a warc!',
            isPartOf: 'My awesome electron1 collection'
          }
        })
      }
    })
hanoii commented 5 years ago

Did the electron request capturer and writer not work for you?

I just tried this and got the exact same behavior, maybe I am missing something related to the warc file that is currently beyond me, but would probably soon get to it. This was mainly making sure this is a workable solution, which it definitely is.

If there's something to follow up here you may want to suggest or for me to help debugging or attempting to get to the root of this, I rather have this small thing working.

N0taN3rd commented 5 years ago

You are not adding the pages array, and the warc is not being written to in appending mode.

See the electron generator docs for more details.

Correcting those issues should help you get your desired results :relaxed:

N0taN3rd commented 5 years ago

See also https://github.com/N0taN3rd/Squidwarc/blob/next/lib/crawler/chrome.js for an example of warc generation