eprints / eprints3.4

EPrints 3.4 core and releases
http://www.eprints.org/uk/index.php/eprints-3-4/
GNU Lesser General Public License v3.0
31 stars 28 forks source link

Incorrect Whitespace in Emails #394

Closed wfyson closed 2 months ago

wfyson commented 3 months ago

Small problem I've come across is the addition of incorrect whitespace in emails. E.g. the default EPrints citation in a saved search email might list the authors as follows:

Fyson, Will , Newman, David (2024) Test Article

With the problem being the extra spaces around the comma separating the two authors.

It looks like this arises as a combination of two things...

1) Rendering of names with a span around each name as defined in Repository.pm at https://github.com/eprints/eprints3.4/blob/77e0584cda4c0896f02920afcd34860b769e1a4a/perl_lib/EPrints/Repository.pm#L3639-L3650

2) "sub build_email" using EPrints::XML::tidy to clean up the XML (https://github.com/eprints/eprints3.4/blob/77e0584cda4c0896f02920afcd34860b769e1a4a/perl_lib/EPrints/Email.pm#L310). This has the effect of putting each new element on a new line so the spans and commas get separated apart on new lines which when put together for the email by https://github.com/eprints/eprints3.4/blob/77e0584cda4c0896f02920afcd34860b769e1a4a/perl_lib/EPrints/Email.pm#L311 results in these extra spaces.

Interestingly the tidy function specifically says not to use in the context of XHTML where white space may matter: https://github.com/eprints/eprints3.4/blob/77e0584cda4c0896f02920afcd34860b769e1a4a/perl_lib/EPrints/XML.pm#L775-L795

So I wonder if we need a new XML::XHTML_Tidy or something similar that performs this tidying up but in a slightly more HTML whitespace-respecting way?

ajmetz commented 3 months ago

The text part uses tree_to_utf8 and the html part uses to_string. I'd need to do some dumps or see example emails, to see if the spaces around the comma are in both text and html parts - or only one part.

Do you know if the problem is in text parts, html parts, or both parts?

That will help you decide it you want a different approach to tidying, or a different approach at some part specific level.

I don't believe I had this problem when testing emails for the various Orcid in Email Context jobs I did. Of course I could be wrong - it was back in Jan/Feb so has been a while now! (Update: There still appears to be a person span around each person - so it probably still has the problem).

What that did, was use a different layout for rendering a name with orcid if the citation was being rendered in a context relevant to preparing an email. https://github.com/eprints/orcid_support/pull/33

Obviously, merging that pull request, for orcid support, and relying on then rendering citations with a specific email context in mind, would have no effect where orcid rendering was not in use, and certainly would have no effect if email context doesn't fix the problem due to a person span still added, so you may indeed still need a more broader solution as mentioned.

ajmetz commented 3 months ago

Just updated the above comment after spotting a person span is still added in the orcid_support, and that may translate to a space still. So all I can say is - I have no idea what I'm talking about until I test it on my own eprints and see what I see.

drn05r commented 2 months ago

@wfyson What does the ASCII citation export for this example look like? Does this have a problem with spacing? Assuming it does not, as we do not need an HTML format of the citation, which the use tree_to_utf8 suggests is the case, might we not be better off recoding to do whatever the ASCII citation does? If we want to generate an HTML format for a multi-part email this may be trickier.

drn05r commented 2 months ago

@wfyson I have done a test with a saved search email and I can see the issue does exist on my test repository and the white space issue is a problem on both the HTML and plaintext parts of the email. Each author is still encapsulated in a span of class person_name so it is not that that is removed resulting in an extra space being added.

After doing some testing if you comment out the EPrints::XML::tidy call in Email.pm it does not change anything visually apart from the incorrect white space. Also looking at the source there only seems to be minor whitespace differences in the source (some separate the the whitespace that can be seen in the email). It does not really tidying anything up AFAICS. Similar the plaintext version is not noticeable affected apart from fixing the errant whitespace.

Therefore, I propose permanently removing the tidy call. It would be useful to test other types of email to see if this has any impact. Also my saved search email only had one results. Maybe the tidy is more useful when there is lots of results.

ORCIDs also seems like something that might break this. I know you can set a different citation style to be used for saved search emails, I don't think there is currently aware of a way of doing this for request a copy emails. A little while ago I added https://wiki.eprints.org/w/Citation_default.pl but this does not cover citation for emails generically (i.e. saved search, editorial alerts, request copy, etc.), which clearly would be useful.