Psychoanalytic-Electronic-Publishing / PEP-Web-User-Interface

Single Page App Graphical User Interface for PEP-Web
1 stars 0 forks source link

Loss of hyphen in header - <artiss> #750

Closed SophieMBennett closed 1 year ago

SophieMBennett commented 1 year ago

@nrshapiro @jordanallen-dev

Recently I've noticed that the hyphen has started to be lost in the Issue header when an issue covers multiple issues.

eg. PSYCHE.071.0738A(bKBD3).xml https://pep-web.org/browse/PSYCHE/volumes/71 Issues 9-10

Previously: PSYCHE 071 0738A_2 Current: PSYCHE 071 0738A XML PSYCHE 071 0738A_3

PCAS_27_2-3_Sep_2022

https://pep-web.org/browse/PCAS/volumes/27

PCAS 27 2-3

ocappello commented 1 year ago

this ranks high among things needing to be sorted out.

jordanallen-dev commented 1 year ago

this ranks high among things needing to be sorted out.

Looking into this now

jordanallen-dev commented 1 year ago

Potentially an issue when loading into Solr

The client renders it directly from the issue property in the server response:

https://stage-api.pep-web.org/v2/Metadata/Contents/PSYCHE/71/?limit=1000&moreinfo=2

image

The server gets this data from Solr where it is also 910:

image

The source XML however has it as the correct 9-10 value:

image

Anything that's changed recently in the loading that could cause this @nrshapiro ? Maybe some trimming or regex replace?

nrshapiro commented 1 year ago

@jordanallen-dev @SophieMBennett

Yes, this happened when I integrated the old ArticleID class and code into the new system...it includes an int value for the issue number, but before converting it cleans up extra characters, which also cleaned up the dash. Now it just considers any non-integer a non-starter and passes back None as the int, but as needed here it keeps the issue string as it was input in the XML.

Watch: Because of the variety of ways art_issue is used by journals, there's no attempted cleanup of the art_issue value any more. If there's a bad input in the XML, it will appear in the TOC. It then needs to be fixed in the XML. There are still a few exceptions to that where "special handling" of known art_issue coding takes place.

In short, this is fixed. I am reprocessing Psyche (all volumes) and PCAS (27). If there are others, let me know. We have no way in the online system of reloading selected issues...(though the loader can do that), and the best way I know to force specific issues to be reloaded is to run a rebuild of those issues here, sync the XML, and then rerun the online system which will reload whatever was rebuilt.

jordanallen-dev commented 1 year ago

@jordanallen-dev @SophieMBennett

Yes, this happened when I integrated the old ArticleID class and code into the new system...it includes an int value for the issue number, but before converting it cleans up extra characters, which also cleaned up the dash. Now it just considers any non-integer a non-starter and passes back None as the int, but as needed here it keeps the issue string as it was input in the XML.

Watch: Because of the variety of ways art_issue is used by journals, there's no attempted cleanup of the art_issue value any more. If there's a bad input in the XML, it will appear in the TOC. It then needs to be fixed in the XML. There are still a few exceptions to that where "special handling" of known art_issue coding takes place.

In short, this is fixed. I am reprocessing Psyche (all volumes) and PCAS (27). If there are others, let me know. We have no way in the online system of reloading selected issues...(though the loader can do that), and the best way I know to force specific issues to be reloaded is to run a rebuild of those issues here, sync the XML, and then rerun the online system which will reload whatever was rebuilt.

Great stuff, thanks Neil

We actually will have a mechanism to run all the opas tools online with full command configuration, scheduling, in response to AWS events, via API, etc soon with an infrastructure that scales with large amounts of concurrency so the processes don't take as long too!

It's a core part of the infra rework this quarter

Currently got it working with the data cleaner but needs a fair bit more work before it's ready for all the use cases

Out of curiosity, if you had the control online now, what commands would you run with the loader to solve it more directly?

nrshapiro commented 1 year ago

@jordanallen-dev

Out of curiosity, if you had the control online now, what commands would you run with the loader to solve it more directly?

Since there was no need to rebuild, just reload:

--key PCAS.027. --reload --key PSYCHE. --reload

or

--key (PSYCHE|PCAS).* --reload

to force a reload of all of the matching articles.

(also include any other standard parameters like -d path --nocheck --nohelp --verbose)

jordanallen-dev commented 1 year ago

@jordanallen-dev

Out of curiosity, if you had the control online now, what commands would you run with the loader to solve it more directly?

Since there was no need to rebuild, just reload:

--key PCAS.027. --reload --key PSYCHE. --reload

or

--key (PSYCHE|PCAS).* --reload

to force a reload of all of the matching articles.

(also include any other standard parameters like -d path --nocheck --nohelp --verbose)

if the run-time for that would be under 15 mins, I can try putting it through the system I have right now? Not tried it with any data loader commands yet as I haven't built the portion to manage the concurrency and batching for the longer processes

nrshapiro commented 1 year ago

@jordanallen-dev - It wouldn't take long, but going through all of Psyche is going to take a while. Probably similar to a nightly current update or a bit longer--probably under 60 minutes for the opasDataLoader portion.

jordanallen-dev commented 1 year ago

@jordanallen-dev - It wouldn't take long, but going through all of Psyche is going to take a while. Probably similar to a nightly current update or a bit longer--probably under 60 minutes for the opasDataLoader portion.

Ran the commands through the system:

image

And same for PCAS

Both executed successfully:

https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252Fopas-data-utility-staging/log-events/2023$252F05$252F12$252F$255B$2524LATEST$255D6c8b9e89639d4228aff50125527c89ac

6 mins for PCAS, 9 minutes for PSYCHE

nrshapiro commented 1 year ago

There's a full rebuild (smart reload) underway, since currently that's the only way to do it and hit the following affected journals:

ZBPA Psyche ZPSAP PSU

I've run the same loader as online on my local system, with a reload, and these are all fixed.

Running the following query in Solr will assert that all are fixed if no records are found:

art_iss:/[0-9]{3,4}/

Will close after verifying.

nrshapiro commented 1 year ago

Verified.