bodleian / ora_data_model

Documentation and crosswalks relating to the ORA data model
1 stars 1 forks source link

Unable to deposit kilo author paper #157

Closed mrdsaunders closed 4 years ago

mrdsaunders commented 4 years ago

We have been unable to deposit a 'kilo-authored' paper which has a long author list (particle physics ATLAS project). An error occurred whose cause was related to an external system. Request timed out.

https://oxris-qa.bsp.ox.ac.uk/viewobject.html?cid=1&id=245519

Papers for such author lists have been OK in RT1/ORA3.

Before I investigate further, are there any limitations on numbers of contributors built in to ORA4?

Error handling a webpage request.txt Application-level error handling a webpage request.txt

tomwrobel commented 4 years ago

Tried to redeposit: https://oxris-qa.bsp.ox.ac.uk/rt2repository.html?pub=245519&sid=29

No deposit is being attempted in Sword2. Our logs show no deposit activity. The error logs above "at Symplectic.Elements.Repository2.XwalkOutParser.ParseToDestinationRecord(XElement, XmlDocument, String) in elements\Repository2\XwalkOutParser.cs:line 16, char 4" suggest very strongly that the timeout is entirely at the Oxris end.

In addition, testing the deposit crosswalk for this record yields a blank result.

I think there is a problem with the outbound crosswalk engine, not with ORA.

I won't say that we wouldn't have a problem ingesting this object! Just that this error is an Oxris one. Contact @AndrewBennet I'm afraid. This may or may not represent a serious problem.

mrdsaunders commented 4 years ago
Hi Dave,

There are some performance limitations we have noticed before when xwalking a document with a very very large amount of data (typically high energy physics papers) - so I suspect that this is the cause here. I don't think we've ever seen it cause a timeout of the web request, though. I will run some xwalk tests locally with large input documents to see if I can reproduce the problem with a Hyrax xwalk map file.

I will also note we have a project underway to overhaul the crosswalk engine in a way which should substantially improve the speed of the crosswalk operation. This work isn't targetted for a specific release yet, but if it becomes clear that performance issues are the cause here, I'll try to push for it to be released as soon as possible.

All the best,
Andrew

Ticket: https://support.symplectic.co.uk/support/tickets/248705
tomwrobel commented 4 years ago

Dear All, but particularly @jjpartridge, @mrdsaunders and @eugeniobarrio

I'm going to call these trouble objects Many-Contributor-Objects, or MCOs.

This may not be resolved on SE before Go-Live. Therefore I'd like to propose a workaround: any MCOs that cannot be deposited into ORA will be deposited manually by the ORA review team. The new ORA object will then be fed back to ORA4-SYMP for ingest into Oxris via harvest (if necessary, some joining will have to happen on the Oxris end).

The MCO in ORA will not have the full list of contributors: just enough to build the citation, list the Oxford authors, and use 'et al' for the rest.

The pathway: 1) User tries to deposit a file attached to an MCO 2) Deposit fails (we have to let it fail because we don't know what the failure point is) 3) User raises issue w. Oxris or ORA. After liaison, Oxris and ORA reviewers identify that this is an MCO. 4) User is helped to create a record in ORA by the ORA review team. The ORA object has a DOI, title, and pubs-id that matches the Oxris record. This on its own should be enough to link the objects on harvest. 5) The ORA object is published (if necessary) 6) The ORA object is pushed to ORA4-SYMP 7) The Oxris harvest collects the new object from ORA and links it dynamically (we can use the 'pid list' functionality to hard set this link).

This should enable the normal editing, reporting, and publication workflow for MCOs, and meet REF compliance. There won't be many of these (only 30 or so in 10 years), so it's tedious but it should get us by.

Thoughts?

mrdsaunders commented 4 years ago

OK in principle with me but I'd estimate it is a greater number than that. There are 50 deposits from particle physics staff with ATLAS in the title (which is the Cern project), and would have thought there were others. I will investigate. The advantage of the narrowness of topic is that we can run comms via particle physics.

mrdsaunders commented 4 years ago

Less than I thought, probably because they cause us all sorts of reporting problems. OAM Live or In review where author list is:

1,000 characters n = 338 10,000 characters n= 52 32,767 characters (Excel cell limit) n = 17

tomwrobel commented 4 years ago

meeting w. @eugeniobarrio 19 December, 7 step process above can be adopted for go live

jjpartridge commented 4 years ago

@mrdsaunders @tomwrobel My suggested workaround would be that a manual record is created in SE for the title. The file is deposited against that record. The record is then merged in SE with the claimed metadata with the extensive author list.

tomwrobel commented 4 years ago

@jjpartridge that would also work

mrdsaunders commented 4 years ago
Hi Dave,

I've spent a little while today attempting to performance-profile the crosswalk operation, to work out why it does not scale well when the number of authors becomes large. I don't have a clear answer yet I'm afraid, and today is my last day before I leave for a Christmas break (back first week of Jan).

One option we could consider in the shorter term is to limit the number of authors processed to - say  - the first 200. This would at least prevent the timeouts from occuring, but it would cause the repository to have missing author data. How much of an issue will this be for you?

All the best,
Andrew
tomwrobel commented 4 years ago

This issue is now closed and merged into #163 - the solution is also to increase the timeout