ericleasemorgan / reader

Distant Reader, a tool for using & understanding a corpus
GNU General Public License v2.0
20 stars 7 forks source link

loop through the database to create a full text index of the selected subset #15

Closed molikd closed 4 years ago

molikd commented 4 years ago

relates to #14

artunit commented 4 years ago

I think one approach to this is to use the Solr's Data Import Handler to leverage the database but pull the full text from the file system, the FileReaderDataSource is sort of an example of this.

molikd commented 4 years ago

We are going to be switching to a Postgres end-point so Data Import Handler might work well. Eric?

artunit commented 4 years ago

Solr normally uses a timestamp column to calculate delta entries, this could be a trigger in sqlite. Postgres would work well for this too, it also has an awesome foreign data wrapper.

artunit commented 4 years ago

I managed to get my laptop trapped in transit between a repair shop and the campus before the start of the pandemic, so these examples are from a borrowed windows machine. I used the sqlite table id as the identifier here but this should be whatever meaningful identifier is possible with the dataset (sha looks like there are duplicates?). So here a full import would use the identifiers from the articles table to determine which file to import. I thought this could be used with JSON files but I think it has to be done with XML inputs. There are some solr-specific functions that I think we would want to position the data for anyway so I have some python code for setting up the XML input, but basically the idea here is for the database to be linked with the indexing (to allow "looping through"). A solr deltaimport, which is typically what you set up in cron or whatever, would keep the index in sync (see sql triggers below).

<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
  <dataSource name="ddr1" 
       url="jdbc:sqlite:C:/util/shared/DistantReader/test/test.db" 
       driver="org.sqlite.JDBC"/>
  <dataSource name="ddr2" type="FileDataSource" encoding="UTF-8" />
  <document>
    <!-- processor for database interactions -->
    <entity name="edr1" dataSource="ddr1" pk="id" processor="SqlEntityProcessor" 
         query="SELECT id from articles"
         deltaQuery="SELECT id from articles WHERE timeStamp > '${dataimporter.last_index_time}'"
         deletedPkQuery="SELECT deleted_id as id from deletes WHERE timeStamp > '${dataimporter.last_index_time}'">
        <!-- processor for filesystem interactons -->
        <entity name="edr2" dataSource="ddr2" 
             processor="XPathEntityProcessor" useSolrAddSchema="true" 
             stream="true" forEach="/doc" 
             url="C:\tmp\xml\${edr1.id}.xml"/> 
    </entity>
  </document>
</dataConfig>

There is a deletes table and I added 2 triggers to the database schema:

CREATE TABLE articles (
        id        INTEGER PRIMARY KEY AUTOINCREMENT,
        sha       TEXT,
        title     TEXT,
        journal   TEXT,
        date      TEXT,
        abstract  TEXT,
        doi       TEXT,
        timeStamp DATE
);
CREATE TRIGGER insert_article_timeStamp AFTER INSERT ON articles
BEGIN
  UPDATE articles SET timeStamp = DATETIME('NOW')
  WHERE rowid = new.rowid;
END;
CREATE TABLE deletes (
        id        INTEGER PRIMARY KEY,
        deleted_id INTEGER,
        timeStamp DATE
);
CREATE TRIGGER insert_after_delete AFTER DELETE ON articles
BEGIN
  INSERT INTO deletes (deleted_id,timeStamp) VALUES (OLD.id,DATETIME('NOW'));
END;
CREATE TABLE authors (
        sha     TEXT,
        author  TEXT
);

The last time I was involved with something like this, the solr index was built totally outside of dataimporthandler, and deltaimports were used to keep ongoing, and typically more minor, updates in sync.

ericleasemorgan commented 4 years ago

On May 14, 2020, at 5:13 PM, David Molik notifications@github.com wrote:

We are going to be switching to a Postgres end-point so Data Import Handler might work well. Eric?

[I'm not sure how this email message will go, but I'll give it a try...]

Does Postgres support the creation of databases from: 1) the command line, and 2) user's other than root? I think we will need both.

Suppose we have a file containing a set of SQL CREATE statements. With the given file, from the command line and sans any secrets, can a Postgres database be created? I already know that it is possible to query a Postgres database, but we need to create one. If we can do this, then let's go with Postgres.

How does that sound?

[Please raise your hand if you got this email message.]

-- Eric M.

ericleasemorgan commented 4 years ago

Solr normally uses a timestamp column to calculate delta entries, this could be a trigger in sqlite. Postgres would work well for this too, it has also has an awesome foreign data wrapper.

I have much more familiarity with MariaDB than I have with Postgres. If Postgres supports the creation of databases from the command line, and if the requests can be issued by a non-root user, then I think Postgres would be a good database back end. What do y'all think? --Eric M.

[I'm still learning how to use GitHub and this email thing.]

ericleasemorgan commented 4 years ago

On May 15, 2020, at 2:31 PM, artunit notifications@github.com wrote:

I managed to get my laptop trapped in transit between a repair shop and the campus before the start of the pandemic, so these examples are from a borrowed windows machine. I used the sqlite table id as the identifier here but this should be whatever meaningful identifier is possible with the dataset (sha looks like there are duplicates?). So here a full import would use the identifiers from the articles table to determine which file to import. I thought this could be used with JSON files but I think it has to be done with XML inputs. There are some solr-specific functions that I think we would want to position the data for anyway so I have some python code for setting up the XML input, but basically the idea here is for the database to be linked with the indexing (to allow "looping through"). A solr deltaimport, which is typically what you set up in cron or whatever, would keep the index in sync (see sql triggers below).

<?xml version="1.0" encoding="utf-8"?>

There is a deletes table and I added 2 triggers to the database schema:

CREATE TABLE articles ( id INTEGER PRIMARY KEY AUTOINCREMENT, sha TEXT, title TEXT, journal TEXT, date TEXT, abstract TEXT, doi TEXT, timeStamp DATE ); CREATE TRIGGER insert_article_timeStamp AFTER INSERT ON articles BEGIN UPDATE articles SET timeStamp = DATETIME('NOW') WHERE rowid = new.rowid; END; CREATE TABLE deletes ( id INTEGER PRIMARY KEY, deleted_id INTEGER, timeStamp DATE ); CREATE TRIGGER insert_after_delete AFTER DELETE ON articles BEGIN INSERT INTO deletes (deleted_id,timeStamp) VALUES (OLD.id,DATETIME('NOW')); END; CREATE TABLE authors ( sha TEXT, author TEXT );

The last time I was involved with something like this, the solr index was built totally outside of dataimporthandler, and deltaimports were used to keep ongoing, and typically more minor, updates in sync.

Art, first of all, please raise your hand if you get this message, because I'm not sure how this thing works, yet. Second, I sincerely appreciate you perseverance.

In the past, I would fill up my Solr instances using a (Perl) script. The script would query a database, map database fields to Solr fields, and incrementally push records to Solr. The process worked for me, but now-a-days Perl is passé, and the process does not scale very well.

What you outline above cuts out Perl and substitutes a JAVA library. Not a problem.

The more I think about it, the more I think we may need/want multiple Solr indexes. For example, we may "read" all of the articles from three different journals and we may want to index the result. That's one index. We will want to index the whole of CORD. That is a second index. There may very well be many many different study carrels.

Assuming each set of stuff we "read" has the exact same data structure, and assuming each thing we "read" has a simple one-word title, then how easy is it for us to create multiple indexes? Attached ought to be the current database schema for each study carrel the Reader creates.

Similarly, one of the coolest things about Solr is faceting. Within my Solr schema, I would define fields which were intended to be facets. In my Perl script I would create lists (arrays), and push each item in the list into a Solr field. I suppose we can do something similar with the thing you outline above?

-- Eric M.

ericleasemorgan commented 4 years ago

On May 15, 2020, at 2:31 PM, artunit notifications@github.com wrote:

<?xml version="1.0" encoding="utf-8"?>

There is a deletes table and I added 2 triggers to the database schema:

CREATE TABLE articles ( id INTEGER PRIMARY KEY AUTOINCREMENT, sha TEXT, title TEXT, journal TEXT, date TEXT, abstract TEXT, doi TEXT, timeStamp DATE ); CREATE TRIGGER insert_article_timeStamp AFTER INSERT ON articles BEGIN UPDATE articles SET timeStamp = DATETIME('NOW') WHERE rowid = new.rowid; END; CREATE TABLE deletes ( id INTEGER PRIMARY KEY, deleted_id INTEGER, timeStamp DATE ); CREATE TRIGGER insert_after_delete AFTER DELETE ON articles BEGIN INSERT INTO deletes (deleted_id,timeStamp) VALUES (OLD.id http://old.id/,DATETIME('NOW')); END; CREATE TABLE authors ( sha TEXT, author TEXT );

Art, as a test, let's use Solr to index the content of a database I've already filled. It is an SQLite database, and it is located on our shared file system at /export/cord/etc/cord.db Attached ought to be the database's schema. Indexing this database ought to prove to be a good use case. If your are amenable, then I will create an issue accordingly. And if you need me to edit the schema, then doing so is trivial, as well as the process of filing up the database again. Okay? --Eric

artunit commented 4 years ago

Sorry to be slow on the uptake, I received an email message for each comment. I don't think much could touch SQLite in terms of ease of deployment, postgres has some unique options to leverage solr capabilities at the sql level, but it might be overkill for this purpose. You definitely don't need postgres for the data import handler, which in itself is really just a way to keep a solr index in sync with a database. I will go through a setup on the shared file system, I wanted to sort out the data handler stuff on my laptop because I tend to lean on a lot on the web-based solr admin interface for setting something like this up. The data handler doesn't impact the index definition at all, it can be as faceted and as extensive as solr schemas allow.

ericleasemorgan commented 4 years ago

../The data handler doesn't impact the index definition at all, it can be as faceted and as extensive as solr schemas allow.

Cool. --Eric M.

artunit commented 4 years ago

../let's use Solr to index the content of a database I've already filled. It is an SQLite database, and it is located on our shared file system

I ran through a simple example on the shared file system, I used an xml version of the records but I didn't realize the full-text of the content was now included in the database. SQLite could be the source of all of the content in the index(es) if that's where all of the content from the dataset ends up.

ericleasemorgan commented 4 years ago

I ran through a simple example on the shared file system, I used an xml version of the records but I didn't realize the full-text of the content was now included in the database. SQLite could be the source of all of the content in the index(es) if that's where all of the content from the dataset ends up.

Interesting. Please point me to the directory on the shared file system where I can see the Solr instance. Similarly, what port Solr using? --Eric

artunit commented 4 years ago

Sorry, it's a long weekend here in Canada and I am just catching up on today's email. The solr instance is in /export/scratch/solr/solr-8.5.1 and I ran the import again with the PlainTextEntityProcessor. I misread the source column in the database last night and thought it was the fulltext, but I think the fulltext for solr should come from the file system (since you already have it nicely processed), and everything else could come directly from the database. I am running this version of solr on port 8984 and you can see the results of the import with:

curl http://localhost:8984/solr/cord/dataimport?command=status

It took just over 11 minutes, but that's only the fulltext. The status report claims there are no skipped documents, but there are documents which did not resolve to a file. The current config for the data handler stuff is in /export/scratch/solr/solr-8.5.1/server/solr/cord/conf/DIHconfigfile.xml. The field layout still needs to be worked out, as well as the faceting and so on, but solr works well with a database for this kind of thing and I don't think the indexing will be terribly onerous. I haven't worked on optimizing solr at all, I never have done much with poking around SolrCloud and using multiple servers.

ericleasemorgan commented 4 years ago

Art, great work. 'More real soon....

ericleasemorgan commented 4 years ago

This is very good work, and we are making progress. Please use the attached file as model for creating the next generation of our index. --ELM

Solr schema

Below is a pseudo-Solr schema. It is the beginnings of a indexing model for the CORD data set. All of the fields are expected to be stored. Only some are indexed, and there are only three data types. (KISS) None of the fields have multiple values (yet). Note the faceted fields because they will be used to support faceted searching and browsing.

Please understand that the creation of our index is iterative. Below is the first outline of the schema. In the (near) future, the underlying database will have additional related tables, and consequently, our schema will include additional fields. Our schema will evolve. For example, there will be faceted authors, faceted sources, faceted entities, and faceted keywords. We will include a field for the full text. When the new related tables are created and filled with content, the index will be expected to support multiple values for some of the fields.

field type indexed


document_id integer true authors text_general true title text_general true date text_general true year integer true journal text_general true source text_general true abstract text_general true license text_general true pdf_json string false pmc_json string false sha string false url string false doi string false arxiv_id string true cord_uid string true mag_id string false pmc_id string true pubmed_id string true who_id string false

facet_year integer true facet_journal string true facet_source string true

-- Eric Morgan emorgan@nd.edu May 19, 2020

ericleasemorgan commented 4 years ago

I think we can consider this "done"; we are now successfully able to create study carrels against subsets of CORD.