Open nathanblogs opened 5 years ago
Hi @nathanblogs. Thanks for reporting this issue.
A solution for your issue would be setting the scout.synchronous
to true
. Using:
config(['scout.synchronous' => true]);
// Run your command
config(['scout.synchronous' => false]);
If it works, do you mind of creating a pull request adding this as optional argument on the reimport
command?
where does scout.synchronous
come from it doesn't seem to be declared anywhere ?
It's not declared, that's why the configuration option that is false
by default.
When is true
, we actually wait for the saveObjects
to finished before moving to the next one: https://github.com/algolia/scout-extended/blob/f3475be9279dce705d438b74f97d91195f5e4409/src/Jobs/UpdateJob.php#L139.
Give it a try, and tell me if it solves your problem.
I'm not sure how UpdateJob is relevent ?
I can't see how that would help, isn't it calling the existing scout function makeAllSearchable. https://github.com/algolia/scout-extended/blob/9aa937089343c05460252b9a438c670b7beebabb/src/Console/Commands/ReImportCommand.php#L81
This intern calls searchable in batches which dispatches jobs.
After that there is a moveindex. https://github.com/algolia/scout-extended/blob/9aa937089343c05460252b9a438c670b7beebabb/src/Console/Commands/ReImportCommand.php#L94 This moveIndex has no ability to tell if all the dispatched searchable jobs have finished, which I believe is the cause of the issue.
I think the whole reimport needs a rethink to be a sync function instead of a temporary index.
Problem finally understood.
I will try to investigate this during the next couple days.
I'm taking this issues @nunomaduro
I got it to work by turning-off the queue entirely.
Example:
config(['scout.queue' => false]);
\Illuminate\Support\Facades\Artisan::call('scout:reimport "App\\\\Models\\\\Thread"');
config(['scout.queue' => true]);
Hopefully your reimports are as rare as mine. I had to make a structural change to everything in an index and added the above to a migration up() method so that I don't have to remember to SSH in to prod when a batch of changes get deployed.
You would also be able to do this through SSH via php artisan tinker
of course
@torrancemiller Maybe we can work in a pull request together on this no?
@torrancemiller that solution doesn't resolve the race condition that can occur unless you put your system into read only model or maintenance.
Basically what can happen is: 1) you start the scout:reimport 2) model id 1 gets push to algolia new temp index. 3) model id 1 gets update in your system 4) scout:reimport finish moving the temp index over the real index
Which is a race condition you now have old / stale data in algolia for model id 1, this is particularly a problem if you have a large dataset or a high frequency of updates on searchable models.
@nathanblogs
I do not believe there is a technical solution to that sort of thing.
I would recommend doing this during a service maintenance window, and close to never on production servers.
However, if you did have to do this in production, I recommend running the php artisan down
command to block incoming requests to your server while the reimport runs.
If that is not an option, you will need to get fancy and write a SQL query to retrieve all models updated after you decided to reimport and call the ->searchable() method on that result set and you should be peachy with them.
@nunomaduro well, I suppose there may be much more elegant fixes if they were to be incorporated into the package. I wish I had more free time to see if there may be a surgical fix here.
@torrancemiller of course there are technical solutions to this.
I believe the easiest solution would be to update UpdateJob function to add a sha1/md5 hash to the array returned by toSearchableArray()
.
Then we could update ImportCommand to do a sync instead of a full import:
scout.chunk.searchable
.getObjects
for this chunk, if missing item from index run searchable
.toSearchableArray()
to algolia record, run `searchable if different.Any thoughts ?
@nunomaduro any reason not to just force SCOUT_QUEUE to be ignored when scout is invoked from this context? I don't see any other way this could possibly work.
You mean in the command itself?
I have my scout queue setting set like this:
'queue' => [ 'queue' => 'search-index', ],
Specifying the queue like this will not wait for the temp index to be ready, but moves an empty index to 'production'.
Anything I can do to get this fixed?
Well, maybe you/we can work in a pull request to get this fixed. Just to be clear, the problem is:
Only when the user is using queues:
1. we copy settings to temporary index
2. we queue jobs of indexation
3. we move the temporary index to production without waiting for the queued jobs to be complete.
I agree on the problem, gonna see if I can fix it in the upcoming days :)
Any news on this?
Edit: I went ahead and just created a new command that sets queue to false, then sets it back to my configured settings.
public function handle()
{
\Config::set('scout.queue', false);
$this->call('scout:reimport');
\Config::set('scout.queue',
[
"queue" => "search",
"connection" => "redis"
]);
}
If you are using a queue to reimport into algolia that isn't FIFO eg AWS SQS, there is a very high chance of hitting a race condition where you move the temp index to the real index without pushing all the data into the new index. I've seen examples of 5-10% of the records missing.
Also depending on how long the reimport takes and how frequently you are updating records it seems like there is a high chance the reimport will import stale data by using this approach timeline eg: 1) start reimport 2) records gets pushed into new temp index 3) record gets updated and pushed into old index 4) temp index gets moved to main index.
You now have stale data in the newly moved index
Unsure is this is related but you seem to reference
config('scout.synchronous', false)
in the code but I can't seem to figure out where or how that is defined ?