Babzsak / duke

Automatically exported from code.google.com/p/duke
0 stars 0 forks source link

Allow concurrent access to the Processor for record linkage #56

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
In our use case we'd like to run several record linkage tasks in parallel. For 
this, the Processor.link(...) method must be implemented thread safe. 
Especially the MatchListeners should be given as an argument, so that one 
instance of the Processor could be initialized.

Original issue reported on code.google.com by FMitzl...@googlemail.com on 7 Nov 2011 at 12:32

GoogleCodeExporter commented 8 years ago
What does this actually mean? Is it several (Lucene index, record set) pairs 
that you want to run in parallell? Or one Lucene index and several record sets? 
If the latter, would it be enough to simply be able to start the process and 
then push in records as you get them, and then have the Processor run 
internally in parallel for better performance?

Original comment by lar...@gmail.com on 7 Nov 2011 at 12:46

GoogleCodeExporter commented 8 years ago
I'll stick to our example: I'm currently implementing a feature for importing 
'friends' from other sites. We use duke to index our user database and whenever 
a user imports a list of 'friends', we link them to our user base with duke's 
record linkage feature. As we have thousands of users, its possible that two 
users independently import 'friends'.

I'd like to have a singleton instance of duke's 'Processor' which is shared 
across all threads (and therefore a single IndexReader for the lucene index). 
Currently the MatchListeners are given during initialization of the Processor. 
In our setting it would be necessary to give the MatchListeners as an argument 
to the 'link(...)'-method.

Does this somehow fit into duke's overall design? (I could implement a proposal 
today for further clarification). 

Original comment by FMitzl...@googlemail.com on 8 Nov 2011 at 5:09

GoogleCodeExporter commented 8 years ago
Yes, this fits well into the design. I haven't worked much on making it 
thread-safe yet, but on the other hand I think most of the code already is 
thread-safe. If you look at the MultithreadProcessor I added on Sunday you can 
see some work toward this, but that was meant to be used to speed up processing.

As far as I can see, if you modify the API so that you can pass in the 
MatchListeners you should have what you need. However, I'm not sure you really 
need  that. Perhaps you should have a single Database instance instead, and 
multiple Processor instances, since it's the Database which really represents 
the Lucene index (and not the Processor), and this way you don't get into 
difficulties with the MatchListeners etc.

Original comment by lar...@gmail.com on 8 Nov 2011 at 5:23

GoogleCodeExporter commented 8 years ago
Worked perfectly as you proposed (single Database) - thank you! I only had to 
add an appropriate constructor which allows to inject a Database (pushed to my 
clone).

Original comment by FMitzl...@googlemail.com on 8 Nov 2011 at 8:37

GoogleCodeExporter commented 8 years ago
Excellent! I'll pull the revision over to the official code ASAP.

Original comment by lar...@gmail.com on 8 Nov 2011 at 9:00

GoogleCodeExporter commented 8 years ago
Added and committed now. Seems to solve the problem, so I'm closing the issue.

Original comment by lar...@gmail.com on 11 Nov 2011 at 8:35