Open nrshapiro opened 1 year ago
This sounds like a relatively easy win to me @nrshapiro !
How much time do you think this will save on each push?
@jordanallen-dev @ocappello @davidtuckett
Well the stat component took 2 hours to complete today, and then the last step failed due to a resource timeout.
I think the change will get it down to < 30 minutes.
After that succeeded, on the last step of the CI the sitemapper failed due to a timeout.
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='ec2-3-81-213-119.compute-1.amazonaws.com', port=80): Read timed out. (read timeout=60) 2023-03-15 14:57:30 opasSiteMapper/sitemapper(122): ERROR Sitemap Error: 404
opasSiteMapper - Open Publications-Archive Server (OPAS) - SiteMapper
@jordanallen-dev
I remember you saying that running the stat updater on every nightly data load is wasteful Under what circumstances should the stat updater be run?
It doesn't really have to be run every night. But the problem is this: if we don't want the citation part to be part of the push to production, then:
a) we have to make sure we run the citation part sometime after the last data add, and before the push to production. That's the hard part...when is the last data add before production?
or
Alternatively, we could just run both during the production run, which would make that take longer and the builds would take much less time. But then Stage wouldn't have the data, during QA, so there could be no checks before a push.
Thus, it currently runs each time we add data (a). It might be nice though to be able to manually suppress it on a build, e.g, if we know there will be another chance to run it before a push, or if it's a very minor update (c!).
@jordanallen-dev
UPDATE:
I already covered a better scenario in the opasDataStat help
opasDataUpdateStat - Program to update the view and citation stat fields in the pepwebdocs Solr database.
By default, it only updates records which have views data.
Use -h or --help for complete options. Below are key ones.
Use command line option --everything to add all citation and views data to pepwebdocs.
(This takes significantly longer.)
- The first update after a load should be with option --everything (--all is deprecated)
- Then, omit --everything to update views daily
- only records with views will update Solr
- views data needs to be updated again when moved to Production, since those are the REAL views for the DB.
The citation data does not need to be reupdated, so you don't need all
- Citations (include --everything) only need be updated after non PEPCurrent data updates
To limit the records to an art_id pattern, use --key pattern, e.g., --key PSYCHE\..*
To limit the views records to after a date, use --since date, e.g., --since 2023-03-01
For complete details, see:
https://github.com/Psychoanalytic-Electronic-Publishing/OpenPubArchive-Content-Server/wiki/Loading-Data-into-OpenPubArchive
The records added are controlled by the database views:
vw_stat_docviews_crosstab
vw_stat_cited_crosstab2
Bad article ids, e.g., ref_rx in these tables will cause "article not found" warnings (in Solr)
I just noticed that pushing to production takes a really long time because the CI process runs opasDataStat in order to update the citation counts and views in Solr.
While it's necessary to update the view counts, since they're different on Stage and Production, it's not necessary to update the citation counts. Those should be the same as it moves from Development to Stage to Production.
We need to separate the stat updates so you can run both on Development/Stage, but just views for on the production run, which should be faster.
@ocappello @jordanallen-dev @davidtuckett