ericleasemorgan / reader

Distant Reader, a tool for using & understanding a corpus
GNU General Public License v2.0
20 stars 7 forks source link

Add entities and entity types to the Solr index #72

Closed ericleasemorgan closed 4 years ago

ericleasemorgan commented 4 years ago

Please add entities and entity types to the Solr index.

I have enhanced and filled our CORD database with an additional table called "ent". The updated database schema is attached. This table contains "named entities" which are really fancy nouns. Given this table, please update the Solr index's schema to include the following fields for each record:

  name           type          multiple value   indexed   stored
  entity         text-string   true             true      true
  type           text-string   true             true      true
  facet_entity   string        true             true      true
  facet_type     string        true             true      true

Once that is done, then fill the Solr index accordingly. Like the other loosely joined tables in the database, one can extract the entities and their types for a given record (document_id) with an SQL query looking like this:

  SELECT DISTINCT e.entity, e.type
  FROM ent AS e, documents AS d
  WHERE e.document_id = d.document_id
  AND d.document_id = 1;

Note the use of DISTINCT. This is necessary since each record may very well include multiple entities of the same name and type.

Once you have re-indexed, I will update my interface for searching the index, and I believe the Solr index's schema will stop churning.

schema.txt

ericleasemorgan commented 4 years ago

'Make sense? Do y'all think you can do this work by Friday?

artunit commented 4 years ago

I have added the fields and re-indexed, I ran into a few memory issues so I added SOLR_JAVA_MEM="-Xms1024m -Xmx2048m" to solr.in.sh. The throughput is close to the original time, maybe an extra minute, but setting up in SolrCloud mode would probably reduce this. I am not sure either _facetentity or _facettype need a specific field, the examples I have seen would not be tokenized, but some of the entries may have hypens, which would argue for them being in the mix. A sample of the facets can be seen here: curl "http://localhost:8983/solr/cord/select?q=*:*&wt=json&fl=id,title&indent=true&rows=2&facet=true&facet.field=facet_authors&facet.field=facet_keywords&facet.field=year&facet.field=facet_journal&facet.field=facet_sources&facet.field=facet_license&facet.field=facet_entity&facet.field=facet_type&facet.limit=5"

ericleasemorgan commented 4 years ago

search.txt

Art, works great. You can see the fruits of your labors through some SSH trickery. If you are using PuTTY as your ssh client, then first tunnel to our Web host:

putty -L localhost:8080:localhost:8080 cord.distantreader.org

If successful, then open your desktop's Web browser, and go to:

http://localhost:8080

Hopefully you will see an HTML page similar to the one attached. From there you can enter just about what ever you want in the search box, and the result will not crash.

Can you see an HTML page?

My next steps are three-fold:

  1. edit my database schema
  2. recreate the database
  3. tell you when I'm done so you can re-index, again

Once we get that done, I think the you will have cycles to play with Solr's performance.

'Make sense?

-- Eric

ericleasemorgan commented 4 years ago

On Jun 1, 2020, at 10:52 PM, Eric Lease Morgan eric_morgan@infomotions.com wrote:

My next steps are three-fold:

  1. edit my database schema
  2. recreate the database
  3. tell you when I'm done so you can re-index, again

Once we get that done, I think the you will have cycles to play with Solr's performance.

Art and Ralph, I have re-created the database using the correct document_id type (INT not TEXT). I have refilled the database too. The database is now about a GB in size. Please reindex and tell me when you are done? --Eric M.

artunit commented 4 years ago

I rebuilt the index and the experience is probably more fodder for pressing the two designated nodes into service. I will sketch out one proposal for a SolrCloud scenario in my next comment but building the index revealed some limitations in the way we are dealing with multiple sql requests in the data handler. We have been using SortedMapBackedCache to improve throughput, and it definitely makes a difference for sub-queries, our previous indexing time was close to 30 minutes with this approach, compared to over 3 hours when using sub-queries.

However, I ran into a wall last night where the cache for entity (about 3.5m distinct values at this point) couldn't be built because it required too much memory. I broke the build into a few smaller bits using solr's clean option for multiple runs, basically using sqlite's between syntax to parcel the data into a couple of builds, i.e.: curl "http://localhost:8983/solr/cord/dataimport?command=full-import&clean=false" The clean directive addresses an the existing index, meaning you can add to an existing index, while between identifies the desired subset in the database:

<entity name="entity"
    query="select distinct entity, document_id from ent where document_id between 100000 and 125000"
    cacheKey="document_id" cacheLookup="edr1.id" cacheImpl="SortedMapBackedCache"
>
    <field column="entity" />
    ...

The overall time is roughly equal to our previous experience, each run took about 5-10 minutes, and there were 5 runs. The first 2 took about twice as long as the other 3, I wondered if there was some dictionary advantage achieved at some point (where there were less unique entries this far into the dataset) but it could also just have been the vagaries of the data.

ericleasemorgan commented 4 years ago

I rebuilt the index and the experience is probably more fodder for pressing the two designated nodes into service. I will sketch out one proposal for this in my next comment but building this revealed some limitations in the way we are dealing with multiple sql requests in the data handler. We have been using SortedMapBackedCache to improve throughput, and it definitely makes a difference for sub-queries, our previous indexing time was close to 30 minutes with this approach, compared to over 3 hours when using sub-queries. I ran into a wall last night where the cache for entity (about 3.5m distinct values at this point) couldn't be built because it required too much memory. I broke the build into a few smaller bits using solr's clean option for multiple runs, basically using sqlite's between syntax to parcel the data into a couple of builds, i.e.: curl "http://localhost:8983/solr/cord/dataimport?command=full-import&clean=false" The clean directive addresses an the existing index, meaning you can add to an existing index, while between identifies the desired subset in the database:

<entity name="entity" query="select distinct entity, document_id from ent where document_id between 100000 and 125000" cacheKey="document_id" cacheLookup="edr1.id" cacheImpl="SortedMapBackedCache"

...

The overall time is roughly equal to our previous experience, each run took about 5-10 minutes, and there were 5 runs. The first 2 took about twice as long as the other 3, I wondered if there was some dictionary advantage achieved at some point (where there were less unique entries this far into the dataset) but it could also just have been the vagaries of the data.

All I can say is, "Wow I'm glad you are here!" Yes, I do not see any churn coming your way in the database. So please explore exploiting the additional nodes and optimization. --Eric

artunit commented 4 years ago

So a couple of notes on SolrCloud, I'm not sure they are applicable to the xsede environment but these are a few lessons from running through this on a few machines here.

ralphlevan commented 4 years ago

Yes, setup is easy. Yes, the nodes need the addresses of the other nodes.

You do not need to partition the input data yourself. Whatever node you give the data to will figure out what each records final destination should be. While I've not used the technique you've been using to load the database, it should still work in the cloud environment.

On Thu, Jun 4, 2020, 11:30 PM artunit notifications@github.com wrote:

So a couple of notes on SolrCloud, I'm not sure they are applicable to the xsede environment but these are a few lessons from running through this on a few machines here.

  • ZooKeeper https://zookeeper.apache.org/ was very easy to deploy
  • for two nodes, I found the second node needed the host specified, e.g. ./bin/solr start -cloud -h ledwebdev.uwindsor.ca -p 8983 -s server/solr/node1/solr -p 8983 -z 10.40.40.232:2181 -m 2g
  • uploading a config set seemed to be the easiest way to handle the data handler files, e,g: ./server/scripts/cloud-scripts/zkcli.sh -zkhost 10.40.40.232:2181 -cmd upconfig -confname cord_configs -confdir ~/stuff/scratch/solr/solr-8.5.1/server/solr/configsets/cord_configs/conf
  • in order to leverage the cloud environment for building the index, it seems to be necessary to target specific cores https://lucene.472066.n3.nabble.com/SolrCloud-DIH-Data-Import-Handler-MySQL-404-td4386685.html. e.g. cord_shard1_replica_n1, which, in turn, requires a mechanism to parcel out subsets of the dataset
  • it is possible to pass parameters to the data import, e.g. ${dataimporter.request.my_variable}, so one option would be to issue imports to each core with a parameter for what portion of data to import

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ericleasemorgan/reader/issues/72#issuecomment-639236952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIH2W6U3N5PONI2WDCQ3PLRVBRENANCNFSM4NP2DKTA .

ralphlevan commented 4 years ago

I'm not going to be at the meeting today. I have a window contractor coming by at the same time.

On Fri, Jun 5, 2020, 8:26 AM Ralph LeVan ralphlevan@gmail.com wrote:

Yes, setup is easy. Yes, the nodes need the addresses of the other nodes.

You do not need to partition the input data yourself. Whatever node you give the data to will figure out what each records final destination should be. While I've not used the technique you've been using to load the database, it should still work in the cloud environment.

On Thu, Jun 4, 2020, 11:30 PM artunit notifications@github.com wrote:

So a couple of notes on SolrCloud, I'm not sure they are applicable to the xsede environment but these are a few lessons from running through this on a few machines here.

  • ZooKeeper https://zookeeper.apache.org/ was very easy to deploy
  • for two nodes, I found the second node needed the host specified, e.g. ./bin/solr start -cloud -h ledwebdev.uwindsor.ca -p 8983 -s server/solr/node1/solr -p 8983 -z 10.40.40.232:2181 -m 2g
  • uploading a config set seemed to be the easiest way to handle the data handler files, e,g: ./server/scripts/cloud-scripts/zkcli.sh -zkhost 10.40.40.232:2181 -cmd upconfig -confname cord_configs -confdir ~/stuff/scratch/solr/solr-8.5.1/server/solr/configsets/cord_configs/conf
  • in order to leverage the cloud environment for building the index, it seems to be necessary to target specific cores https://lucene.472066.n3.nabble.com/SolrCloud-DIH-Data-Import-Handler-MySQL-404-td4386685.html. e.g. cord_shard1_replica_n1, which, in turn, requires a mechanism to parcel out subsets of the dataset
  • it is possible to pass parameters to the data import, e.g. ${dataimporter.request.my_variable}, so one option would be to issue imports to each core with a parameter for what portion of data to import

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ericleasemorgan/reader/issues/72#issuecomment-639236952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIH2W6U3N5PONI2WDCQ3PLRVBRENANCNFSM4NP2DKTA .

artunit commented 4 years ago

From what I can glean, the data handler plumbing doesn't get automatic partitioning like other updates, but it's all new to me and the information that I have been able to piece together online is not extensive. I did read that SolrCloud does not partition a data handler job, but instead hands it to one core, but even that didn't happen in my testing. I could build against individual cores directly but the import jobs seemed to be ignored at the collection level. The thread I linked to is from 2018 and Solr 6x, it might not be relevant. Anyway, I am looking forward to seeing how it works, good luck with the contractor!

ericleasemorgan commented 4 years ago

Again, I'm very glad y'all are here, and I feel you are making progress. Thank you. --Eric M.

artunit commented 4 years ago

Not a problem, glad to help. It's a cool project and I have learned lots!

ralphlevan commented 4 years ago

Like I said, I've not used your technique. I wrote a bulkdatahandler which throws bundles of Lucene documents at the cluster. Once Eric C. gets me access to the cluster, I'll see if I can use Art's datahandler.

On Fri, Jun 5, 2020, 11:11 AM artunit notifications@github.com wrote:

Not a problem, glad to help. It's a cool project and I have learned lots!

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ericleasemorgan/reader/issues/72#issuecomment-639555815, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIH2W5JXT7OG6ADP4HEIQ3RVEDLFANCNFSM4NP2DKTA .

ericleasemorgan commented 4 years ago

BTW, I believe Eric C is taking PTO, so he might not get back to you before Monday. --Eric M.

ericleasemorgan commented 4 years ago

I believe this is done. Closing.