akhoury / nodebb-plugin-import

migrate your old crappy forum to NodeBB
MIT License
78 stars 31 forks source link

ability to resume without flushing #140

Closed akhoury closed 8 years ago

akhoury commented 8 years ago

Technically, it's already supported, (well not really, i need to skip the flushing) the thing is that, now, the importer will check each record if it was imported before attempting to import it again, which is much faster than the actual import but slower than knowing for fact that it was imported (or just being told to skip over it anyways)

I am thinking 3 buttons instead of the current 2:

pseudocode here


<!-- already exists -->
if  (import_process_was_interrupted) {
   <button id="resume">Looks like the import was interrupted, resume</button> 
}

 <!-- already exists -->
<button id="flush-start">Flush NodeBB DB and start import</button>

<!--  to be added  -->
<button id="noflush-start">Don't Flush NodeBB, import what hasn't been imported yet</button> 

The only difference between #resume and #noflush-start would be that the latter would check ALL records and import what hasn't been imported yet, where as #resume is aware of the order and which "phases" were already done.

@BenLubar If you want more control, we can either use the existing getPaginated[RECORD] (start, limit, callback)

where START == offset + current where offset would be something you specify in the config. - the thing is that you're gonna have to specify this for each type of record, kinda like

offsets {
   categories: 0,
   users: 100000,
   topics: 40000,
   posts: 150000,
   votes: 12345,
   groups: 4567,
   bookmarks: 0
}

This would be easy to implement, but hard to use, IMO.

preferred way

Another way we can do this is to use timestamps range.... something like

timerange: {
    start: 123456789000,
    end: NULL // means all the way to the last record.
}

But then you're your exporter plugin gonna need to use these timestamps in your queries, by using the config.custom hash, then you enter something like

   {"timerange": { "start": 123456789000} }

in the Exporter Config Custom, (yes, you type it in as JSON, I know, don't ask)

And you click the, to-be-added, Import without flush button, this way you avoid having the importer checking every record if it was imported from record0, but rather only what the exporter-queries return.

bdharrington7 commented 8 years ago

Imho supporting no-flush can become a nightmare. The data isn't necessarily linear unless you can guarantee said time stamp will contain things like

This is a pretty big ask, and this plugin will be much more maintainable of its kept as something that facilitates a singular event.

akhoury commented 8 years ago

You're right,

we could do a an object diff instead of just checking for existence, for example, checking the _imported_FIELDs after each recoverImported[RECORD], if it's not imported, delete import it, if it changed, then edit the diff.

But you're right, it going to be a pain in the but, but each changed field might mean different behavior, a changed content on a post, is easy to edit, but a moved topic under a new _cid means that we need to recoverImportedCategory first to find the new cid.

Maybe at first we just make the assumption that previously-imported data is frozen at the source-forum. Hopefully, we can also assume that there's only a short period of time between the two imports.

bdharrington7 commented 8 years ago

I dunno it feels like this is one of those "nice to have" things with not a lot of benefit. If it's an interesting puzzle you'd like to solve though that's a different story!

On Dec 9, 2015, at 6:53 AM, Aziz Khoury notifications@github.com wrote:

You're right,

we could do a an object diff instead of just for existence, for example, checking the _imported_FIELDs after each recoverImported[RECORD], if it's not imported, delete import it, if it changed, then edit the diff.

But you're right, it going to be a pain in the but, but each changed field might mean different behavior, a changed content on a post, is easy to edit, but a moved topic under a new _cid means that we need to recoverImportedCategory first to find the new cid.

Maybe at first we just make the assumption that previously-imported data is frozen at the source-forum. Hopefully, we can also assume that there's only a short period of time between the two imports.

— Reply to this email directly or view it on GitHub.

akhoury commented 8 years ago

well, the thing is, some really really large forums could take 10, 20 hours or more, to import, and with the amount of traffic these get, it's not always a good idea to put them "in maintenance mode" for that long.

bdharrington7 commented 8 years ago

Ah, that's a use case I hadn't considered. I've only been working with a couple thousand posts

On Dec 9, 2015, at 8:12 AM, Aziz Khoury notifications@github.com wrote:

well, the thing is, some really really large forums could take 10, 20 hours or more, to import, and with the amount of traffic these get, it's not always a good idea to put them "in maintenance mode" for that long.

— Reply to this email directly or view it on GitHub.

pauljherring commented 8 years ago

For our use case (17 hours in a test import on bare-metal hardware over a local LAN, not virtualised stuff like AWS or whatever over a WAN,) and from what I recall from the import script, a simple (hah!) remember the

and then continue from the next one in each series.

might be sufficient (again, hah!)

Basically what we'd like to do when we're ready to move after all the test imports we do for this is:

  1. backup the existing forum.
  2. import that (17 hrs or whatever) and customise stuff that needs customising that couldn't be done before initial import
  3. during the 17 hrs, use existing forum (more users, more topics, more posts to existing topics, more actions, more (other stuff) )
  4. freeze the existing forum and final backup
  5. import the 'new stuff'
  6. continue with new forum as if nothing happened.

The theory being that the proposed functionality would save us 17 hours.

Not included here (because it's outside of scope) is the backing up+restoration of site assets outside of the DB (local images referenced in posts for example,) though [thinking aloud] they could probably go through a similar procedure - the most basic being something like rsync if the original assets can't be used directly to begin with. That's part of our problem I think, not yours, though.

akhoury commented 8 years ago

@pauljherring what about the concerns that @bdharrington7 mentioned, for example, say during step 3 (during the 17 hours) a post/topic content gets edited, topic gets moved to another category, etc...

pauljherring commented 8 years ago

Inform users that the board is in the "transfer stage for the next 24 hours, if all goes well" and that such things may go astray, so don't expect to see them later. In the instances where it really matters (PII removal involved e.g.) the mods will be aware of them (and the fact of the transfer) and to look at/check the end result for such things.

There is only so much that can (or in the specific instances you mention should) be accommodated in such circumstances.

Coding against such things is usually either a effort wasted (in the actual coding not actually being used) in the worst case, and a saving of very little of content (i.e. nothing of value retained) lost in the best. I may have best and worst swapped there. Depending on PoV.

akhoury commented 8 years ago

fair enough

pauljherring commented 8 years ago

I'm of the opinion that Pareto holds and sorting out the 90% automatically, and use your mods to sort out the 10% remaining is acceptable... (or whatever ratio it ever ends up being. It's never 80/20.)

Just be sure that you (or rather the people using the importer) know what the 10% will be in the situation.

Otherwise, to accommodate, you'll spend the other 190% of the time coding against that 10%... (wasted effort for no real value.)

It is, IMHO, an acceptable compromise when moving forum software, while trying to keep the ethos of the board being moved between two different forums. Keep the major stuff, the minor stuff - while nice - isn't really quite that important. If it can be easily done, do it. Otherwise, ditch it (if irrelevant) or make it clear what won't be happening (if possibly relevant/peculiar/obscure.)

Bear in mind that most of a moving community will (I'd hope) understand they'd be losing some (more visible) features while gaining others - and their attitude to losing some edits/moves/whatever particular sub-feature the old board had will be, to the vast majority, be...

meh.

But then I come from (or are rather speaking on behalf of) a rather antagonistic community that will be to switching - and even they (if informed) would be understanding of such 'flaws' in the import procedure.

akhoury commented 8 years ago

so, the green button bellow, will now skip the "flush step" (where the orange one will flush first) - however, it would still check each record that comes in from the exporter if it was already-imported or not.

screenshot

So @pauljherring what I would do is the following (and i am quoting your steps from above)

  1. backup the existing forum.
  2. [edited] import that (17 hrs or whatever) using the orange button "Flush NodeBB DB And Import" and customize stuff that needs customizing that couldn't be done before initial import
  3. [new] record the last import record of each "type", i.e. LAST_KNOWN_IMPORTED_POST_ID OR you can record the timestamp range, or the offset for the limit, as I explained in the original post
  4. during the 17 hrs, use existing forum (more users, more topics, more posts to existing topics, more actions, more (other stuff) )
  5. freeze the existing forum and final backup
  6. [edited] import the 'new stuff' using the green button, "Don't Flush NodeBB DB, Just Import"
    • However, right before this step, @BenLubar will need to make some changes to his import-discourse module, to SELECT the new records only, some like SELECT .... FROM posts WHERE post_id > LAST_KNOWN_IMPORTED_POST_ID, or you can choose to make it configurable, kind of like an offset config for each type of record.
    • If you find your self that, because fuck it, you're just gonna hardcode these value in that module's path/to/NodeBB/node_modules/nodebb-plugin-import-discourse/index.js and change the queries for that 1 time use - then remember to check the checkbox that says "Skip the module install" (it's for development purposes, see this screenshot), otherwise nodebb-plugin-import will force reinstall the module
  7. continue with new forum as if nothing happened.
akhoury commented 8 years ago

@pauljherring , what the bottleneck that's taking 17 hours?

img img-source

I don't have a database with that many users or votes to test on, i could build one, but I'm lazy.

I just tested 30k topics, 300k posts, 10k users on my mac mini, i5, 16gb ram, 256 ssd, mongodb

30k topics take 5minutes 300k posts take about 1 hour (you have 600k, so let's say 2 hours there) 10k users took 2 minutes (so if i estimated correctly, 140k user should take 14*2= 28minutes)

Something is growing exponentially, could be a memory leak or something.

Any chance i can get a copy of that large DB dump you got (please obfuscate emails, password and other sensitive info in it)

pauljherring commented 8 years ago

Unfortunately, the DB also has in it private messages (since they're simply another form of topic).... sanitising it for 'public' consumption isn't a quick or easy thing to do, sadly. (You're not the first to ask for such.)

Could the size of the posts be an issue? Recent DB backup:

sockbot@work:~$ gzip -l /home/sockbot/SockBot/backups/what-the-daily-wtf-2015-09-24-035955.tar.gz 
         compressed        uncompressed  ratio uncompressed_name
          768738095          3246448640  76.3% /home/sockbot/SockBot/backups/what-the-daily-wtf-2015-09-24-035955.tar
sockbot@work:~$ 
akhoury commented 8 years ago

understood. no worries.

akhoury commented 8 years ago

size as the content length of each post? - shouldn't be an issue unless we're hitting memory limits. Other than that, the importer does not parse or do any string operations on the post.content

akhoury commented 8 years ago

I mean, the importer will fetch 500,000 records at a time from the discourse db, load them in memory, then import them 10 at a time into NodeBB. These magic numbers are based of off nothing, so.. I should probably make them part of the config too.

pauljherring commented 8 years ago

Incidentally, the numbers on that screencap aren't entirely accurate, especially wrt 'topics' (which as previously stated includes private messages, but will also include 'hidden' topics and 'deleted' ones):

postgres@what:~$ psql -d discourse -c "select count(*) from users"
 count  
--------
 141130
(1 row)

postgres@what:~$ psql -d discourse -c "select count(*) from topics"
 count 
-------
 19967
(1 row)

postgres@what:~$ psql -d discourse -c "select count(*) from posts"
 count  
--------
 615980
(1 row)

postgres@what:~$ 
akhoury commented 8 years ago

and 2.5M votes? right? I'm trying to build a sample database with a similar amount of records for testing - i mean 18 hours seems like a lot.

I just imported 223k users, 60k topics, 215k posts, and 52k private-messages in less than 1.5 hours, on my macbook. Im gonna triple the number of posts and create 2.5M votes and test again.

pauljherring commented 8 years ago

and 2.5M votes? right?

likes in discourse?:

discourse=# select action_type, count(*) from user_actions group by action_type order by action_type;
 action_type |  count  
-------------+---------
           1 | 2482528
           2 | 2482516
           3 |    5653
           4 |   12900
           5 |  568374
           6 |  400515
           7 |   34175
           9 |   24733
          11 |    3512
          12 |   35879
          13 |   91843
(11 rows)
discourse=# select id, name_key from post_action_types;
 id |     name_key      
----+-------------------
  1 | bookmark
  2 | like
  3 | off_topic
  4 | inappropriate
  5 | vote
  8 | spam
  6 | notify_user
  7 | notify_moderators
(8 rows)

No idea what types 9-13 are...

akhoury commented 8 years ago

Ok so, just tested, macbook pro, i7, 16gb, ssd, mongo 3.0 (redis would be faster a bit)

Total: so almost an exact 11 hours. Seems like votes are the bottleneck here. let me see what I can do there