Psychoanalytic-Electronic-Publishing / OpenPubArchive-Content-Server

A document server with an open API for which you can build client apps on top of, that can serve journal, book, or video content, where all this content is in XML and can be served out in XML, HTML, PDF, or EPub per the client.
Apache License 2.0
2 stars 3 forks source link

Push to Production runs OpasDataStat on citations as well as views #177

Open nrshapiro opened 1 year ago

nrshapiro commented 1 year ago

I just noticed that pushing to production takes a really long time because the CI process runs opasDataStat in order to update the citation counts and views in Solr.

While it's necessary to update the view counts, since they're different on Stage and Production, it's not necessary to update the citation counts. Those should be the same as it moves from Development to Stage to Production.

We need to separate the stat updates so you can run both on Development/Stage, but just views for on the production run, which should be faster.

@ocappello @jordanallen-dev @davidtuckett

jordanallen-dev commented 1 year ago

This sounds like a relatively easy win to me @nrshapiro !

How much time do you think this will save on each push?

nrshapiro commented 1 year ago

@jordanallen-dev @ocappello @davidtuckett

Well the stat component took 2 hours to complete today, and then the last step failed due to a resource timeout.

I think the change will get it down to < 30 minutes.

After that succeeded, on the last step of the CI the sitemapper failed due to a timeout.

requests.exceptions.ReadTimeout: HTTPConnectionPool(host='ec2-3-81-213-119.compute-1.amazonaws.com', port=80): Read timed out. (read timeout=60) 2023-03-15 14:57:30 opasSiteMapper/sitemapper(122): ERROR Sitemap Error: 404

opasSiteMapper - Open Publications-Archive Server (OPAS) - SiteMapper
nrshapiro commented 1 year ago

@jordanallen-dev

I remember you saying that running the stat updater on every nightly data load is wasteful Under what circumstances should the stat updater be run?

It doesn't really have to be run every night. But the problem is this: if we don't want the citation part to be part of the push to production, then:

a) we have to make sure we run the citation part sometime after the last data add, and before the push to production. That's the hard part...when is the last data add before production?

or

Alternatively, we could just run both during the production run, which would make that take longer and the builds would take much less time. But then Stage wouldn't have the data, during QA, so there could be no checks before a push.

Thus, it currently runs each time we add data (a). It might be nice though to be able to manually suppress it on a build, e.g, if we know there will be another chance to run it before a push, or if it's a very minor update (c!).

@jordanallen-dev

UPDATE:

I already covered a better scenario in the opasDataStat help

   opasDataUpdateStat - Program to update the view and citation stat fields in the pepwebdocs Solr database.

      By default, it only updates records which have views data.

      Use -h or --help for complete options.  Below are key ones.

      Use command line option --everything to add all citation and views data to pepwebdocs.
      (This takes significantly longer.)

         - The first update after a load should be with option --everything (--all is deprecated)
         - Then, omit --everything to update views daily
            - only records with views will update Solr
            - views data needs to be updated again when moved to Production, since those are the REAL views for the DB.
              The citation data does not need to be reupdated, so you don't need all
         - Citations (include --everything) only need be updated after non PEPCurrent data updates

      To limit the records to an art_id pattern, use --key pattern, e.g., --key PSYCHE\..*

      To limit the views records to after a date, use --since date, e.g., --since 2023-03-01

         For complete details, see:
          https://github.com/Psychoanalytic-Electronic-Publishing/OpenPubArchive-Content-Server/wiki/Loading-Data-into-OpenPubArchive

      The records added are controlled by the database views:
         vw_stat_docviews_crosstab
         vw_stat_cited_crosstab2

         Bad article ids, e.g., ref_rx in these tables will cause "article not found" warnings (in Solr)