WorldBrain / Memex

Browser extension to curate, annotate, and discuss the most valuable content and ideas on the web. As individuals, teams and communities.
https://worldbrain.io
4.38k stars 335 forks source link

Upload Test Data Set #37

Closed blackforestboi closed 7 years ago

blackforestboi commented 7 years ago

As a user I want to export the data from the extension, save it as a JSON (?) file and import it again. This would make it possible to do testings with larger data sets.

Somehow related to the work done in #19, at least the import part. Since this feature will have to come anyhow, we should take both use cases into account: manual upload & transfer from old extension.

poltak commented 7 years ago

The SerDe stuff, to get DB docs to string and vice-versa, could be done with:

Trickier part is getting the dumped string into a file that we can serve to the user to download to their system. I don't think we can just write a file to the extension then create a link on that. Another option is to programmatically upload it to a website and point the user to that to download. This seems dumb however because the file may be very big and needs a secure solution. Will look around for some way to create a file for download from an extension. Maybe it's not so difficult.

Accepting the file to import should be able to be done with a file input and some container code passing it to pouchdb-load fairly easily.

poltak commented 7 years ago

Trickier part is getting the dumped string into a file that we can serve to the user to download to their system

Actually pretty easy. We can create a new URL for an in-memory blob (from the pouchdb-replication-stream output) and use download attribute on an anchor element, pointing href at that URL. More info here. Pretty cool. Basic example in my browser works fine.

blackforestboi commented 7 years ago

From a UI POV, we could add this as a separate module in the settings with the title "Backup and Restore Database"

bohrium272 commented 7 years ago

A bit more tedious but maybe we can upload it to the user's Google Drive and give that URL to the user. It may be undesirable for large databases though. We could have it as an option maybe.

blackforestboi commented 7 years ago

Great idea and feature upgrade!

poltak commented 7 years ago

Good bit of initial progress on looking into this further today. Made up a little prototype to confirm my understanding of the pouch libs + the File/FileReader APIs. Never really messed with files much in front-end JS. Mostly working as expected; able to dump pouch to a downloadable text file, and also upload a correctly formatted text file and read the contents to restore from a dump. Can't get the pouchdb-load package to read it properly however; probably something I'm doing wrong. Want to try a few more things, else I'll raise an issue upstream. Will be able to use the basic code from the prototype in the real thing, probably in a redux thunk or RxJS epic (if worth the effort).

Plans for main functionality:

Dump:

  1. have user trigger an action that uses pouchdb-replication-stream to form a string of the dump in memory (needs UI loading state)
  2. create an object URL (URL.createObjectURL()) from a File object constructed on the dump string
  3. set up an <a download="dump.txt"> with href pointing to the object URL, so user can download dump file to their machine

Restore:

  1. have user upload a dump file via an <input type="file">
  2. some form of validation on uploaded file
  3. read contents into memory via FileReader.readAsText()
  4. pass to pouchdb-load to restore from dump into pouch

Other notes/ideas:

blackforestboi commented 7 years ago

string of the dump in memory

Any way to avoid this? If the DB is about 1GB big, this will lead to the extension crashing. Maybe dumping to localstorage?

poltak commented 7 years ago

Any way to avoid this?

Apart from writing (via node-like stream) to an actual file on the filesystem (as opposed to in-memory file), or having some sort of streamed download, no. Writing to the actual filesystem isn't possible in standard frontend JS, from what I understand.

Possibly there's a webextension specific way to interact with a file. Even if it means we have an extra file in the extension just to write dumps to and allow a user to download, it should work, although messy. Will look into these options more today, as yes, it will quickly get out-of-hand with bigger data.

RE local storage: there is a limit of ~5MB IIRC. Even with unlimited storage, the file will still need to be created in memory at some point for the user to download.

EDIT: this looks promising EDIT2: nope, seems to be Chrome apps only

poltak commented 7 years ago

Another alternative could be dumping to many files. pouchdb-replication-stream dumps the DB contents to a node-like stream, meaning it does it batched over "chunks" of the DB dump. In node, for example, you would append to a file on disk each time the stream signals it has data then forget about it, so that what you are reading doesn't all collect in memory at once.

We only have memory to work with in browser JS, so instead of writing progress to a single file, we could create an in-memory file for each "chunk" read from the DB (as opposed to collecting all the chunks and making a single big in-memory file). Then use the Chrome downloads API to auto download a single in-memory file each time the stream signals it has data.

Played with this in the prototype, and TBH, while it solves the memory issue, it doesn't provide a very nice UX. Just with 88 bookmarks docs imported (+ associated page and visit docs), using ~390KB-5MB (depending on how many page docs have been filled out during imports) it needs 18 separate files. Using a larger set of data (18k+ browser history stubs + all associated data), you get this kind of fun.

Could maybe batch it up further, so say after collecting ~15MB of data (arbitrary choice) from the stream, then make an in-memory file, call the Downloads API, then forget about it. Still leaves the user with their DB dump spread across multiple files though, and need to find a way to create an in-order stream from the dump files for the restore process, and feed that back into pouch, else the memory problem still exists for restore... will play with these ideas a bit.

blackforestboi commented 7 years ago

@bigbluehat you may have some ideas how to solve this? Thanks for the input!

poltak commented 7 years ago

Bit of an update on the restoring process I was having trouble with: after spending more time looking into it, it seems to be a bug with pouchdb-replication-stream handling certain types of attachments in the browser. In the dump process, it dumps them to base64 strings, but in the restore process it tries to parse them as Blobs but gets a TypeError, as they aren't Blobs. Acknowledged as a bug upstream in this issue and here's my issue for reference. For a bit more reference: Pouch docs for attachment data-types.

Bit disappointing as there don't seem to be any other dump APIs for pouch, and just writing the DB docs to a file isn't recommended for certain reasons relating to how pouch stores data. My proof-of-concept works fine when manually excluding attachments from our docs, but then we lose attachments (favicon, screenshots, freezedry).

Having a bit of a read through and play with the upstream poucdb-replication-stream code to try and see if it's simple to change the storage type for attachments. Seems to be very pouch-centric, i.e., may take some time to understand what's going on.

Put the proof-of-concept code up on feature/backups branch for now, for my own development ease (and we'll probably build off this).

Regarding the actual backup process, I think streaming the DB dump into multiple dump files is the way to go. Without a server, I'm pretty sure this is the only way to do it without loading every single thing into memory. We're very limited with filesystem-related tasks in frontend JS. My restore logic seems to work fine with reading and restoring from multiple files, apart from the attachment issue, which is great.

A little disappointing, but unavoidable, thing with this is the time it takes to actually dump: Testing with 59705 docs in pouch (mostly page stubs, related visits and some bookmarks), it dumps the entirety of ~17MB of data in about 2:10 mins. For much bigger data sets, say if those pages are filled out and it's closer to 1GB, it will take some time indeed.

blackforestboi commented 7 years ago

Ah bummer! Checking pouchDBs issues, there are multiple relating to the same limitation...seems to be an issue, eh? :) I recently found this DB here, which seems to be a further development from pouch: https://github.com/pubkey/rxdb

May be worth a look. (I like the encryption features as well, as it could be good addition to our plans of replicating the DB to our servers for the hosting purposes)

Regarding the file sizes:

So what are the junk sizes for a 1GB DB?

In the download process, we could suggest the user takes a new folder as a target for the backup and then when uploading again, he picks the folder with all the blocks and we do the rest. This way it is still just picking one element instead of all the blocks?

poltak commented 7 years ago

I recently found this DB here, which seems to be a further development from pouch: https://github.com/pubkey/rxdb

Yes this one did look interesting. I'll spend a bit more time looking at it, as we do seem to have quite a lot of pouch related pain. It's a non-trivial change though, and may complicate things with upstream. Depends on the ratio of amount of work needed for benefits we get, for right now I suppose.

So what are the junk sizes for a 1GB DB?

If "junk" means "dump", I'm pretty sure it should be roughly 1GB. The stored size shouldn't be much different to the dump without compression, as it's a full replication dump of the DB.

we could suggest the user takes a new folder as a target for the backup

Yeah that's what I've got the code doing now. Dumps to a worldbrain-backup/ dir in the user's system Downloads directory and dump files are named like worldbrain-dump.${timestamp}.${seqNumber}.txt. I'm pretty sure can make the file input be able to select a directory as well (currently selects multiple files).

blackforestboi commented 7 years ago

with junks i meant the actual batch sizes, not the dump as a whole, sorry for the confusion. It seems like this question is solved though with what you describe in the second part of your answer.

regarding RXDB:

What are the challenges with such a switch? Would the attachment download work? If yes, @treora may also be interested in hearing about this, respectively a possible switch to it, as he also played with the idea of making a local backup. If that is currently not possible with pouch, am I correct in the observation that we only have 2 options?: making a PR that would support it in pouch or switching to RXDB?

poltak commented 7 years ago

With the batch sizes, or dump file sizes, they can be whatever size we want. Basically however much we want to be storing in memory at any given time during the backups/restore process. Maybe 50MB or something (so 21 or so files for 1GB)?

Would the attachment download work?

RxDB docs or issues don't seem to mention how they handle pouch attachments or any mention regarding storage of binary data. They may just let pouch handle it at a lower level if you add _attachments field to a doc, which pouch does also with PouchDB.put(). Would have to try.

More concerned about what's available for dumping RxDB, as this was why it was brought up here. They have a single db.dump() method returning a Promise, meaning they don't afford any way to stream the data, meaning you're essentially reading the entire contents of it into memory when you dump (and restore). Bit of a Google around doesn't give any other options.

am I correct in the observation that we only have 2 options?

Yeah those, or making our own dump functionality for Pouch. I'd be more inclined to looking at seeing how difficult it would be to fix this at upstream pouchdb-replication-stream, since there's less re-inventing the wheel there.

blackforestboi commented 7 years ago

I'd be more inclined to looking at seeing how difficult it would be to fix this at upstream pouchdb-replication-stream, since there's less re-inventing the wheel there.

Ok, how about we first get everything running without attachments, then look for how to fix it upstream? @treora might be interested to collaborate on that fix, as it might touch his plans as well.

Regarding a future decision to use RXDB:

What clear advantages do you currently see with it? What I see relevant for later is:

poltak commented 7 years ago

how about we first get everything running without attachments

That's not easy to do either. The dump doesn't afford omitting/projecting out data like that, only filtering docs.

That means the only way right now would be parsing the dump as it comes in (in newline delimited JSON) on the stream, and manually removing the attachment, before serialising it again to NDJSON for file creation. There can be a lot of data and is already quite slow. This would complicate the process more and result in a lossy backup feature (but maybe better than none?). We can either do it like that for now, or postpone this feature for a bit until we see if we can fix it upstream.

RE RxDB discussion: I think better you make another issue for that and everyone can discuss it more there.

blackforestboi commented 7 years ago

As discussed, for now we reduce the feature to solely import, so we can test better with large data sets. Changing title accordingly to: "Upload Test Data Set"

poltak commented 7 years ago

Idea with this now, as just an upload for test data set, is to reuse the upload button logic to be able to upload a number of generated dump files. Then have a separate node script to generate the dump file/s using a data generator package, like faker, to fill out certain attributes, like fullText, title, url. I think the script can just generate those objects, then I can have it create visit, bookmark, and page docs based off those with the inter-document references.

The script should output new-line delimited JSON dump files, in the PouchDB replication format that is given in here.

poltak commented 7 years ago

After playing with a little prototype script, it's become obvious that if we want to produce pouch-replication-formatted dump files, we'll need the Pouch revisions (_rev field), which don't seem to be easy to fake as they are handled by Pouch automatically on insert/update ops. Instead, as this is just for our own sake, could simplify the dump format to just an array of docs and insert them using db.bulkDocs() rather than replication stream.

Script is here for now: https://github.com/poltak/worldbrain-data-generator Very simple, just generates random data for our data model (page doc, random number of associated visit docs, and change to produce an associated bookmark doc), then should generate dump file/s for a specified amount of page datas (in progress).

blackforestboi commented 7 years ago

. Then have a separate node script to generate the dump file/s using a data generator package, like faker,

Why don't we use the kaggle data set, as we have real text data there then? The only thing to fake is the visit time then? Also if we use such a standard test set, it's easier for others to be used later. Not only for having a standardised set, but also to test the work flow of actually choosing a file to upload.

Probably we'd first have to upload it, so everything is put in the format of page objects and visits and then dump it, so it is looking like a real backup?

poltak commented 7 years ago

Why don't we use the kaggle data set, as we have real text data there then?

I decided to just generate the data as it simplifies things not needing to input, parse, and convert data. The real text data shouldn't really be an issue. However, I think maybe I'll just use it and make the script a converter, as then the script will have a defined input format. Then we can put whatever test data in that format, or even generate it, as long as it conforms to that .

Here's the set for reference: https://www.kaggle.com/patjob/articlescrape So input would be a CSV with cols body, title, last_crawl_date, url. Other content our model has is description, keywords, canonicalUrl, however for the purposes of this, I think this data can just be faked or derived. Visit times can be faked based on the last_crawl_date. There won't be any attachments (screenshot, favicons, freezedry) for now.

Probably we'd first have to upload it, so everything is put in the format of page objects and visits and then dump it, so it is looking like a real backup?

A real backup, IMO, should include PouchDB-specific metadata (revisions, versioning data detailing how documents have changed over time), as well as the docs data. However, this depends on how you consider a "backup" in the big scope of the project; maybe just the docs data provides the same outcome for us. So tech-wise, this will just be an importer for docs data (gotten from the conversion script, or wherever), but user/us-wise, it imports data for the extension.

poltak commented 7 years ago

Alright the converter script is in a usable state. It can convert that kaggle dataset (and other any CSV data with title, body, and url columns) into docs compatible with our model, and I updated the restore code on feature/backups to be able to ingest this into our DB.

The converter script can be downloaded from npm via npm install -g worldbrain-data-converter and run in the command line like worldbrain-data-converter -i /path/to/input.csv -o path/to/output.txt. So download that dataset off kaggle, if you want to try. I recommend splitting the output files as our extension is still limited to the browser, hence has to read in an entire file into memory at a time. More info and usage examples (including the splitting) is on the npm package docs: https://www.npmjs.com/package/worldbrain-data-converter

@oliversauter Regarding the UI for uploading a test dataset: It's essentially just a file input. User clicks, selects the data files, and it parses them and adds them to the DB. I think maybe should just have a "Developer mode" checkbox on imports page, which when checked shows that input. Seems common in some other extensions I've seen. What do you think?

Regarding the script: this may be of use to us in the future if we have a lot of data stored in CSVs that we want to bring into the extension. There's some additional options to do some experimental stuff, like setting the isStub flag on the converted page docs so that they can actually be filled out properly in a later import process (so a use-case could be for real data, if we have an option to disable generation of missing data fields, and generated visits). At the moment it's strictly for test data, since it fakes a lot of important stuff, like visits, but could revisit it later for other uses.

blackforestboi commented 7 years ago

I think maybe should just have a "Developer mode" checkbox on imports page, which when checked shows that input.

Good idea! Lets do it like that.

poltak commented 7 years ago

Alright simple UI for it is now on imports under a dev mode checkbox in feature/backups (branch name no longer relevant). UI is very basic, given it's for internal use: input + a loading spinner while Pouch does its thing. No file validation. Seems to be working nicely for the purposes of bringing in test data from that script.