bzz / scholar-alert-digest

Aggregate unread emails from Google Scholar alerts
Apache License 2.0
43 stars 5 forks source link

Error on extracting papers from email \w 'Showing less relevant results' #76

Open tombrainbox opened 2 years ago

tombrainbox commented 2 years ago

I've noticed that any scholar alert emails that have been configured with 'all results' rather than 'most relevant' result in an error when processed by this tool. This might because each email starts with:

"Showing less relevant results because there are no great results

Update alert to receive fewer, more relevant results"

Am I correct in this, and if so would this be an easy fix to implement? Here is my code (note this happens in json/html or with just minimal flags):

go run main.go -l 'GScholar' -read -authors
2022/04/11 10:04:41 searching and fetching messages from Gmail: "label:GScholar is:unread"
2022/04/11 10:04:41 searching messages from Gmail: "label:GScholar is:unread"
2022/04/11 10:04:41 14 messages found (took 0 sec)
14 / 14 [-----------------------------------------------------] 100.00% ? p/s 1s
2022/04/11 10:04:42 14 messages fetched (took 0 sec)
2022/04/11 10:04:42 14 messages found&fetched with (took 0 sec)
2022/04/11 10:04:42 searching and fetching messages from Gmail: "label:GScholar is:read"
2022/04/11 10:04:42 searching messages from Gmail: "label:GScholar is:read"
2022/04/11 10:04:42 1 messages found (took 0 sec)
1 / 1 [-------------------------------------------------------] 100.00% ? p/s 0s
2022/04/11 10:04:42 1 messages fetched (took 0 sec)
2022/04/11 10:04:42 1 messages found&fetched with (took 0 sec)
2022/04/11 10:04:42 rendering 2 papers
# Google Scholar Alert Digest

**Date**: 2022-04-11T10:04:42+01:00
**Unread emails**: 14
**Paper titles**: 2
**Uniq paper titles**: 2

## New papers

 - [Cerebellar Transcranial Magnetic Stimulation (TMS) Impairs Visual Working Memory](https://link.springer.com/article/10.1007/s12311-022-01396-2), <i>N Viñas</i> (1)
   <details>
     <summary>… As a precaution, the coil was positioned using the Brainsight navigator and the</summary>
     <div>experimenter monitored for potential deviation of the target, the “bullseye,” and maintained the coil position targeting the cerebellum targets if needed. Details of this …</div>
   </details>

 - [Short-term facilitation effects elicited by cortical priming through theta burst stimulation and functional electrical stimulation of upper-limb muscles](https://link.springer.com/article/10.1007/s00221-022-06353-3), <i>Update Alert To Receive Fewer, More Relevant Results</i> (1)
   <details>
     <summary>… The coil position and orientation were monitored throughout the experiment using a</summary>
     <div>neuronavigation system (Brainsight, Rogue Research, Montreal, Canada). Ten TMS stimuli, with approximately 5–7 s inter-stimulus intervals, were delivered for …</div>
   </details>

## Old papers

<details id="archive">
  <summary>Archive</summary>

</details>
2022/04/11 10:04:42 Errors: 13
bzz commented 2 years ago

That seems like a bug, thank you for catching it, @tombrainbox!

This bug happens due to a change in email HTML template for specific cases that includes "Showing less relevant results because there are no great results". I was able to find such emails (only 7 out of ~2k of 'all results' in my case) and reproduce the failure.

For such a template seem to include an extra "hidden" paper in it 🤯 , a duplicate of the first one, that for some obscure reason our XPath library is not able to match //h3/a/@href agains :/ which leads to an error https://github.com/bzz/scholar-alert-digest/blob/7d2e4de957edf2864360a95b579fc919e9fd561f/papers/papers.go#L137 that results in skipping the whole email's content from the aggregation.

This is wired since XPath browser extension (and default search in Chromium) for the same expressions both returns the right number of titles and urls! So, most probably, this has to do with the logic in https://github.com/antchfx/htmlquery 😕 and a fix would require us to introduce some unit-tests that would first reproduce it precisely \wo touching GMail API (example).

79 has the instructions on localising this bug, and I will look more into it this when time permits. Meanwhile, any attempt to take a stab at digging deeper and reporting the results here/sending a PR with reproducing test/sharing ideas on possible heuristic for a workaround would be very appreciated!