Closed SophieMBennett closed 1 year ago
this ranks high among things needing to be sorted out.
this ranks high among things needing to be sorted out.
Looking into this now
Potentially an issue when loading into Solr
The client renders it directly from the issue
property in the server response:
https://stage-api.pep-web.org/v2/Metadata/Contents/PSYCHE/71/?limit=1000&moreinfo=2
The server gets this data from Solr where it is also 910
:
The source XML however has it as the correct 9-10
value:
Anything that's changed recently in the loading that could cause this @nrshapiro ? Maybe some trimming or regex replace?
@jordanallen-dev @SophieMBennett
Yes, this happened when I integrated the old ArticleID class and code into the new system...it includes an int value for the issue number, but before converting it cleans up extra characters, which also cleaned up the dash. Now it just considers any non-integer a non-starter and passes back None as the int, but as needed here it keeps the issue string as it was input in the XML.
Watch: Because of the variety of ways art_issue is used by journals, there's no attempted cleanup of the art_issue value any more. If there's a bad input in the XML, it will appear in the TOC. It then needs to be fixed in the XML. There are still a few exceptions to that where "special handling" of known art_issue coding takes place.
In short, this is fixed. I am reprocessing Psyche (all volumes) and PCAS (27). If there are others, let me know. We have no way in the online system of reloading selected issues...(though the loader can do that), and the best way I know to force specific issues to be reloaded is to run a rebuild of those issues here, sync the XML, and then rerun the online system which will reload whatever was rebuilt.
@jordanallen-dev @SophieMBennett
Yes, this happened when I integrated the old ArticleID class and code into the new system...it includes an int value for the issue number, but before converting it cleans up extra characters, which also cleaned up the dash. Now it just considers any non-integer a non-starter and passes back None as the int, but as needed here it keeps the issue string as it was input in the XML.
Watch: Because of the variety of ways art_issue is used by journals, there's no attempted cleanup of the art_issue value any more. If there's a bad input in the XML, it will appear in the TOC. It then needs to be fixed in the XML. There are still a few exceptions to that where "special handling" of known art_issue coding takes place.
In short, this is fixed. I am reprocessing Psyche (all volumes) and PCAS (27). If there are others, let me know. We have no way in the online system of reloading selected issues...(though the loader can do that), and the best way I know to force specific issues to be reloaded is to run a rebuild of those issues here, sync the XML, and then rerun the online system which will reload whatever was rebuilt.
Great stuff, thanks Neil
We actually will have a mechanism to run all the opas tools online with full command configuration, scheduling, in response to AWS events, via API, etc soon with an infrastructure that scales with large amounts of concurrency so the processes don't take as long too!
It's a core part of the infra rework this quarter
Currently got it working with the data cleaner but needs a fair bit more work before it's ready for all the use cases
Out of curiosity, if you had the control online now, what commands would you run with the loader to solve it more directly?
@jordanallen-dev
Out of curiosity, if you had the control online now, what commands would you run with the loader to solve it more directly?
Since there was no need to rebuild, just reload:
--key PCAS.027. --reload --key PSYCHE. --reload
or
--key (PSYCHE|PCAS).* --reload
to force a reload of all of the matching articles.
(also include any other standard parameters like -d path --nocheck --nohelp --verbose)
@jordanallen-dev
Out of curiosity, if you had the control online now, what commands would you run with the loader to solve it more directly?
Since there was no need to rebuild, just reload:
--key PCAS.027. --reload --key PSYCHE. --reload
or
--key (PSYCHE|PCAS).* --reload
to force a reload of all of the matching articles.
(also include any other standard parameters like -d path --nocheck --nohelp --verbose)
if the run-time for that would be under 15 mins, I can try putting it through the system I have right now? Not tried it with any data loader commands yet as I haven't built the portion to manage the concurrency and batching for the longer processes
@jordanallen-dev - It wouldn't take long, but going through all of Psyche is going to take a while. Probably similar to a nightly current update or a bit longer--probably under 60 minutes for the opasDataLoader portion.
@jordanallen-dev - It wouldn't take long, but going through all of Psyche is going to take a while. Probably similar to a nightly current update or a bit longer--probably under 60 minutes for the opasDataLoader portion.
Ran the commands through the system:
And same for PCAS
Both executed successfully:
6 mins for PCAS, 9 minutes for PSYCHE
There's a full rebuild (smart reload) underway, since currently that's the only way to do it and hit the following affected journals:
ZBPA Psyche ZPSAP PSU
I've run the same loader as online on my local system, with a reload, and these are all fixed.
Running the following query in Solr will assert that all are fixed if no records are found:
art_iss:/[0-9]{3,4}/
Will close after verifying.
Verified.
@nrshapiro @jordanallen-dev
Recently I've noticed that the hyphen has started to be lost in the Issue header when an issue covers multiple issues.
eg. PSYCHE.071.0738A(bKBD3).xml https://pep-web.org/browse/PSYCHE/volumes/71 Issues 9-10
Previously: Current: XML
PCAS_27_2-3_Sep_2022
https://pep-web.org/browse/PCAS/volumes/27