Closed akhoury closed 8 years ago
Imho supporting no-flush can become a nightmare. The data isn't necessarily linear unless you can guarantee said time stamp will contain things like
This is a pretty big ask, and this plugin will be much more maintainable of its kept as something that facilitates a singular event.
You're right,
we could do a an object diff instead of just checking for existence, for example, checking the _imported_FIELDs
after each recoverImported[RECORD]
, if it's not imported, delete import it, if it changed, then edit the diff.
But you're right, it going to be a pain in the but, but each changed field might mean different behavior, a changed content
on a post, is easy to edit, but a moved topic under a new _cid
means that we need to recoverImportedCategory
first to find the new cid
.
Maybe at first we just make the assumption that previously-imported data is frozen at the source-forum. Hopefully, we can also assume that there's only a short period of time between the two imports.
I dunno it feels like this is one of those "nice to have" things with not a lot of benefit. If it's an interesting puzzle you'd like to solve though that's a different story!
On Dec 9, 2015, at 6:53 AM, Aziz Khoury notifications@github.com wrote:
You're right,
we could do a an object diff instead of just for existence, for example, checking the _imported_FIELDs after each recoverImported[RECORD], if it's not imported, delete import it, if it changed, then edit the diff.
But you're right, it going to be a pain in the but, but each changed field might mean different behavior, a changed content on a post, is easy to edit, but a moved topic under a new _cid means that we need to recoverImportedCategory first to find the new cid.
Maybe at first we just make the assumption that previously-imported data is frozen at the source-forum. Hopefully, we can also assume that there's only a short period of time between the two imports.
— Reply to this email directly or view it on GitHub.
well, the thing is, some really really large forums could take 10, 20 hours or more, to import, and with the amount of traffic these get, it's not always a good idea to put them "in maintenance mode" for that long.
Ah, that's a use case I hadn't considered. I've only been working with a couple thousand posts
On Dec 9, 2015, at 8:12 AM, Aziz Khoury notifications@github.com wrote:
well, the thing is, some really really large forums could take 10, 20 hours or more, to import, and with the amount of traffic these get, it's not always a good idea to put them "in maintenance mode" for that long.
— Reply to this email directly or view it on GitHub.
For our use case (17 hours in a test import on bare-metal hardware over a local LAN, not virtualised stuff like AWS or whatever over a WAN,) and from what I recall from the import script, a simple (hah!) remember the
and then continue from the next one in each series.
might be sufficient (again, hah!)
Basically what we'd like to do when we're ready to move after all the test imports we do for this is:
The theory being that the proposed functionality would save us 17 hours.
Not included here (because it's outside of scope) is the backing up+restoration of site assets outside of the DB (local images referenced in posts for example,) though [thinking aloud] they could probably go through a similar procedure - the most basic being something like rsync if the original assets can't be used directly to begin with. That's part of our problem I think, not yours, though.
@pauljherring what about the concerns that @bdharrington7 mentioned, for example, say during step 3 (during the 17 hours) a post/topic content gets edited, topic gets moved to another category, etc...
Inform users that the board is in the "transfer stage for the next 24 hours, if all goes well" and that such things may go astray, so don't expect to see them later. In the instances where it really matters (PII removal involved e.g.) the mods will be aware of them (and the fact of the transfer) and to look at/check the end result for such things.
There is only so much that can (or in the specific instances you mention should) be accommodated in such circumstances.
Coding against such things is usually either a effort wasted (in the actual coding not actually being used) in the worst case, and a saving of very little of content (i.e. nothing of value retained) lost in the best. I may have best and worst swapped there. Depending on PoV.
fair enough
I'm of the opinion that Pareto holds and sorting out the 90% automatically, and use your mods to sort out the 10% remaining is acceptable... (or whatever ratio it ever ends up being. It's never 80/20.)
Just be sure that you (or rather the people using the importer) know what the 10% will be in the situation.
Otherwise, to accommodate, you'll spend the other 190% of the time coding against that 10%... (wasted effort for no real value.)
It is, IMHO, an acceptable compromise when moving forum software, while trying to keep the ethos of the board being moved between two different forums. Keep the major stuff, the minor stuff - while nice - isn't really quite that important. If it can be easily done, do it. Otherwise, ditch it (if irrelevant) or make it clear what won't be happening (if possibly relevant/peculiar/obscure.)
Bear in mind that most of a moving community will (I'd hope) understand they'd be losing some (more visible) features while gaining others - and their attitude to losing some edits/moves/whatever particular sub-feature the old board had will be, to the vast majority, be...
meh.
But then I come from (or are rather speaking on behalf of) a rather antagonistic community that will be to switching - and even they (if informed) would be understanding of such 'flaws' in the import procedure.
so, the green button bellow, will now skip the "flush step" (where the orange one will flush first) - however, it would still check each record that comes in from the exporter if it was already-imported or not.
So @pauljherring what I would do is the following (and i am quoting your steps from above)
LAST_KNOWN_IMPORTED_POST_ID
OR you can record the timestamp range, or the offset for the limit, as I explained in the original post SELECT
the new records only, some like SELECT .... FROM posts WHERE post_id > LAST_KNOWN_IMPORTED_POST_ID
, or you can choose to make it configurable, kind of like an offset
config for each type of record.@pauljherring , what the bottleneck that's taking 17 hours?
I don't have a database with that many users or votes to test on, i could build one, but I'm lazy.
I just tested 30k topics, 300k posts, 10k users on my mac mini, i5, 16gb ram, 256 ssd, mongodb
30k topics take 5minutes 300k posts take about 1 hour (you have 600k, so let's say 2 hours there) 10k users took 2 minutes (so if i estimated correctly, 140k user should take 14*2= 28minutes)
Something is growing exponentially, could be a memory leak or something.
Any chance i can get a copy of that large DB dump you got (please obfuscate emails, password and other sensitive info in it)
Unfortunately, the DB also has in it private messages (since they're simply another form of topic).... sanitising it for 'public' consumption isn't a quick or easy thing to do, sadly. (You're not the first to ask for such.)
Could the size of the posts be an issue? Recent DB backup:
sockbot@work:~$ gzip -l /home/sockbot/SockBot/backups/what-the-daily-wtf-2015-09-24-035955.tar.gz
compressed uncompressed ratio uncompressed_name
768738095 3246448640 76.3% /home/sockbot/SockBot/backups/what-the-daily-wtf-2015-09-24-035955.tar
sockbot@work:~$
understood. no worries.
size as the content
length of each post? - shouldn't be an issue unless we're hitting memory limits.
Other than that, the importer does not parse or do any string operations on the post.content
Incidentally, the numbers on that screencap aren't entirely accurate, especially wrt 'topics' (which as previously stated includes private messages, but will also include 'hidden' topics and 'deleted' ones):
postgres@what:~$ psql -d discourse -c "select count(*) from users"
count
--------
141130
(1 row)
postgres@what:~$ psql -d discourse -c "select count(*) from topics"
count
-------
19967
(1 row)
postgres@what:~$ psql -d discourse -c "select count(*) from posts"
count
--------
615980
(1 row)
postgres@what:~$
and 2.5M votes? right? I'm trying to build a sample database with a similar amount of records for testing - i mean 18 hours seems like a lot.
I just imported 223k users, 60k topics, 215k posts, and 52k private-messages in less than 1.5 hours, on my macbook. Im gonna triple the number of posts and create 2.5M votes and test again.
and 2.5M votes? right?
likes
in discourse?:
discourse=# select action_type, count(*) from user_actions group by action_type order by action_type;
action_type | count
-------------+---------
1 | 2482528
2 | 2482516
3 | 5653
4 | 12900
5 | 568374
6 | 400515
7 | 34175
9 | 24733
11 | 3512
12 | 35879
13 | 91843
(11 rows)
discourse=# select id, name_key from post_action_types;
id | name_key
----+-------------------
1 | bookmark
2 | like
3 | off_topic
4 | inappropriate
5 | vote
8 | spam
6 | notify_user
7 | notify_moderators
(8 rows)
No idea what types 9-13 are...
Ok so, just tested, macbook pro, i7, 16gb, ssd, mongo 3.0 (redis would be faster a bit)
Total: so almost an exact 11 hours. Seems like votes are the bottleneck here. let me see what I can do there
Technically,
it's already supported, (well not really, i need to skip the flushing) the thing is that, now, the importer will check each record if it was imported before attempting to import it again, which is much faster than the actual import but slower than knowing for fact that it was imported (or just being told to skip over it anyways)I am thinking 3 buttons instead of the current 2:
pseudocode here
The only difference between
#resume
and#noflush-start
would be that the latter would check ALL records and import what hasn't been imported yet, where as#resume
is aware of the order and which "phases" were already done.@BenLubar If you want more control, we can either use the existing
getPaginated[RECORD] (start, limit, callback)
where
START == offset + current
whereoffset
would be something you specify in the config. - the thing is that you're gonna have to specify this for each type of record, kinda likeThis would be easy to implement, but hard to use, IMO.
preferred way
Another way we can do this is to use timestamps range.... something like
But then you're your exporter plugin gonna need to use these timestamps in your queries, by using the
config.custom
hash, then you enter something likein the Exporter Config Custom, (yes, you type it in as JSON, I know, don't ask)
And you click the, to-be-added,
Import without flush
button, this way you avoid having the importer checking every record if it was imported from record0, but rather only what the exporter-queries return.