Xunius / Menotexport

Python solution to export annotations from your Mendeley library.
GNU General Public License v3.0
124 stars 20 forks source link

Wrong coordinate ordering in "Rectangle Highlight" #29

Closed matteosecli closed 6 years ago

matteosecli commented 6 years ago

Symptoms

The highlights made with the "Highlight Rectangle" tool in Mendeley reader looked quite odd in some readers after the export with Menotexport. In Adobe Reader, the left and right sides of the rectangle were bent inward; with Preview.app, the rectangle was just shifted upwards; with Poppler (Evince, Okular, etc) the rectangle had strangely shaped corners (look at screenshots below). The rectangles in other readers, instead, looked just fine.

Issue

From my understanding, Adobe Reader wants the /QuadPoints in a weird order: top-left, top-right, bottom-left, bottom-right. I see that the implementation in Menotexport follows this convention, that other readers also adopt (like Preview.app and Poppler). However, the highlight-making function was called with the assumption that the rectangle coordinates stored in Mendeley's database are always in the order [ x0, y0, x1, y1 ] where (x0, y0) are the lower-left corner coordinates and (x1, y1) are the top-right corner coordinates. Unfortunately, I've discovered that this is not always the case.

It seems that the storage ordering is correct if the highlights were made with the "Highlight Text" tool; the ones made with the "Highlight Rectangle" tool, instead, have a wrong ordering (at least in my case). I have to say that I always draw rectangle highlights by dragging from the top-left corner to the bottom-right, and this is indeed the ordering that I've found in the table; I don't know if this is a coincidence or not, but it doesn't matter at this point.

Solution

I've tested a small modification in the getHighlights() function; If you add after the line

        bbox = [r[2], r[3], r[4], r[5]] 

the following two lines

        if bbox[0] > bbox[2]: bbox[0], bbox[2] = bbox[2], bbox[0]
        if bbox[1] > bbox[3]: bbox[1], bbox[3] = bbox[3], bbox[1]

which restore the correct ordering if it's wrong, then the highlights looks just fine.

I've attached below some screenshots: on the left there is one of my test documents with the odd-looking highlights and on the right the result after the fix.

I've not submitted this fix as a PR because you've said you're reworking the code a little bit, so I've directly posted the snippet here.


Further Notes

The highlights still look a little bit odd; in Preview.app the biggest rectangle has somehow gained some extra height, while in Adobe Reader (and in Poppler) the highlights are styled in a way that tries to give a more "highlight made with a real marker" style, with rounded lateral sides, but fails when the highlight are tall rectangles.

By digging around, I've found that some readers style the highlight just as a plain rectangle, which reproduces much more closely the look that you have in Mendeley. I think it's worth listing them:

(*) The highlights look just the same, but the sticky notes don't work at all.

(**) The previous ones are not picky about the ordering of QuadPoints; Preview.app, instead, is quite picky and as you see some rectangles may still look weird even after the fix.

Upcoming fix

The solution is not to let the reader style the annotations on its own, i.e. to provide clear styling instructions for the annotations to the reader. This allows to unify the aspect of the annotations on readers that support this styling, like Adobe Reader and (partially) other readers. In some cases, like in the one of Firefox's PDF reader, the annotations even become finally visible (currently, if you open one of the PDFs made with Menotexport in Firefox, you won't see any annotation).

Just as an example, this is how the same test PDF as before looks like (in Adobe Acrobat Reader DC) after some styling of the annotations:

schermata 2018-07-23 alle 14 11 17

As you see, now all the highlights (text or rectangle) have straight sides and the appearance of the sticky note was slightly modified.

More on that coming soon!

Xunius commented 6 years ago

Hi matteosecli,

That's quite impressive work! I'll try to get my on-going fix done by the end of this week, there are multiple changes so I want to test it a bit more, and double check the docs and comments (now I'm now working alone). I'll drop you a notice when I'm done then we can start working on your fixes.

matteosecli commented 6 years ago

Sure, that sounds perfect! 😉

Xunius commented 6 years ago

Hi matteosecli, I pushed a new commit to the dev branch as I'm still not quite sure it doesn't break, in particular, I found the DoumentNotes table in the sqlite data is completely empty in my database file, yet Mendeley still displays those notes which is really weird. So menotexport can't fetch any side-bar notes and I can't do any tests on them at the moment, but they form a particular scenario that side-bar notes can be made on documents that have no attachments, then the handling of highlight/note paths (I've associated the file path to notes/highlights to address the multi-attachment issue) and related things needs to be test. So I wonder if you could pull a new copy from the dev branch and help double checking it?

matteosecli commented 6 years ago

Sure, I'll test asap in one of the coming days!

matteosecli commented 6 years ago

Hi @Xunius, I've finally found some time to test.

I've used the GUI to select my "Test Folder" with a few documents inside, and I've checked only the box "Extract Notes". The program, unfortunately, hangs at the first document and the terminal reports the following error:

Exception in thread work:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "menotexport-gui.py", line 73, in run
    menotexport.main(*self.args)
  File "/home/osboxes/Menotexport/menotexport.py", line 1300, in main
    separate,iszotero,verbose)
  File "/home/osboxes/Menotexport/menotexport.py", line 1044, in processDocs
    annotations,flist=extractAnnos(annotations,action,verbose)
  File "/home/osboxes/Menotexport/menotexport.py", line 893, in extractAnnos
    if extracthl2.checkPdftotext():
UnboundLocalError: local variable 'extracthl2' referenced before assignment

I also confirm that, although I have a test document with "General Notes" in the sidebar, my DocumentNotes table is still empty. Instead, I've found that the "General Notes" are stored in the column note of the Documents table, at least in the latest Mendeley version without the encrypted database (version 1.18.1).

The annotation extraction from multi-document entries (in my case, the ones that weren't working when I reported https://github.com/Xunius/Menotexport/issues/26) seems to work like a charm! 😄

Xunius commented 6 years ago

Hi matteosecli,

Many thanks for the help. The extracthl2 error was a stupid mistake but can be easily fixed.

Yes I found my "General Notes" in the "note" column instead of of the "DocumentNotes" table (I'm on 1.17.9), they must have made a change somewhere before that version. But, I also found in many documents the "note" column is filled up with doi numbers, some with the same value as the "doi" column (e.g. 10.1126/science.aab3050), some with a doi: prefix (e.g. note=doi: 10.1021/ed020p517.1, doi= 10.1021/ed020p517.1), and some with an empty "doi" column and has the doi in the "note" column (e.g. note=10.1126/science.1116448, doi = ''), hope I'm not confusing you, it's quite a mess.

Does your table has such a doi mess? I'm planning to do a regex matching of the "note" string and exclude those matching the doi pattern from "General Notes". I should be able to push the commit later today.

I probably have to revisit some old versions as we probably need to support both note extraction (from the "DocumentNotes" table and "Documents-note" column). btw, do you happen to know is it possible to fetch the Mendeley version from the sqlite data file? I didn't manage to find the relevant info.

matteosecli commented 6 years ago

Out of the 338 documents in my library, only 15 entries are not null. Out of these:

I'm not sure is these fields were compiled by Mendeley itself or they result from an import of a bib file created with another program, that maybe just put these info in the standard BibTeX note field instead of using a dedicated field. In general I've imported very few documents from bib files into Mendeley, so maybe that's the reason I have so few of these "weird" notes.

I don't know if you can fetch Mendeley's version from the database file. But maybe the info in the table SchemaVersion could be useful? My table, on Mendeley 1.18.1, is

key value
schemaVersion 84
compatibleSchemaVersion 75
lastRunSchemaVersion 84
onlineDataVersion 2
needsRepair 0

I should have an older Mendeley version on another computer; I can check the same table on that version and then report back here.

Xunius commented 6 years ago

You are probably right that these "weird" notes come from imported bib or ris files. I just did a few tests, by adding an annote = {{note1}} entry in a bib file, or a N1 - note1 entry in a ris file, and importing it in Mendeley, the "General Notes" field will contain "note1". So they very likely came from imports, and will be whatever the citation provider puts in.

So, should we just treat them as legit notes or filter them? Because clearly it's not just doi strings so no easy way to distinguish them from user's notes.

Regarding Mendeley version, in my 1.17.9 version, I have these:

key value
schemaVersion 83
compatibleSchemaVersion 75
lastRunSchemaVersion 83
onlineDataVersion 2

No needsRepair line but it's beside the point. Actually I might just as well ignore the version and fetch whichever (Documents.note or DocumentNotes) has anything.

matteosecli commented 6 years ago

So, should we just treat them as legit notes or filter them? Because clearly it's not just doi strings so no easy way to distinguish them from user's notes.

You're right, and that's the main reason I'd prefer not to do any filtering. It should be a user's responsibility to manually check the metadata in Mendeley itself and keep it as organized as possible; though a feature that would track down wrongly placed metadata (e.g. a doi in the notes and not in the doi field) and fix them would be useful, I'd say it's beyond the scope of Menotexport. And there's also the risk that this kind of filtering messes up with somebody's legit notes (maybe one wants to e.g. put alternative ISBN codes in the notes) and causes a user data loss.

Regarding Mendeley's version, I have a table with the same values as yours in Mendeley 1.17.11, which unfortunately doesn't help. Do you have some old databases that actually contain something in DocumentNotes?

An alternative would be to download old Mendeley versions in a virtual machine up to the point in which the notes in Documents.note are found in DocumentNotes.

Xunius commented 6 years ago

I didn't notice your comment until I've pushed a commit a few minutes ago, and in the commit I added a DOI filtering, but that could be easily removed should we decide to.

I'm pretty sure some old version saves notes in the DocumentNotes table because that's where I originally fetched something when writing this tool. A bit unfortunate that I haven't noticed this change for so long and nobody is complaining about it. My current strategy is query both (Documents.note and DocumentNotes) and combine them, as I assume only one should contain contents.

I feel this commit pretty much finalizes my fix for the multi-attachment issue and I'm preparing to do a merge back to the master. What do you think? UPDATE: I've merged.

matteosecli commented 6 years ago

I didn't notice your comment until I've pushed a commit a few minutes ago, and in the commit I added a DOI filtering, but that could be easily removed should we decide to.

I would still prefer not to have any filtering, but since you've already done the work this mechanism could be put behind an option and activated at user's will. It could even go further and "fix" the metadata by e.g. filling in the correct doi field taken from the note and then delete the note content.

My current strategy is query both (Documents.note and DocumentNotes) and combine them, as I assume only one should contain contents.

Sounds reasonable!

I feel this commit pretty much finalizes my fix for the multi-attachment issue and I'm preparing to do a merge back to the master. What do you think? UPDATE: I've merged.

Seems good to me; I'll do some more tests with the new master code and report here if there are any problems! 😉


In the meantime, I'm testing old Mendeley versions trying to figure out where the change happened. I couldn't manage to download (yet) versions prior 1.8, so I'm testing from 1.8 upwards. So far, I've discovered some interesting stuff (the numbers in parenthesis are in the form schemaVersion/compatibleSchemaVersion of a clean database created with that version of Mendeley).

Since the old custom tags are gone and now the layout is done with regular HTML tags, I'd say that we should also "normalize" the layout in such a way that the final result is the same, no matter if the same note was created in Mendeley 1.8 or in Mendeley 1.17. Newer notes are already formatted with HTML tags, so maybe a reasonable option would be to transform the old layout and alignment tags in regular HTML tags. What do you think?

Xunius commented 6 years ago

Hi matteosecli,

Many thanks for all the hard work, I now feel sorry for taking you so much time on such trivial things. I tend to suggest that lets not look beyond v1.8 which is nearly 8 years old now and probably very few people are still using. Regarding the saving location of notes, I think we could put each query (DocumentNotes and Documents.note in a try-exception block, so if either one fails it must be the other one. The formatting tags, I agree that it would be nicer to normalize to the HTML formats. I'm thinking doing that using regex, to replace <m:bold>Bold</m:bold> -> <bold>Bold</bold>, shouldn't be too tricky I think.

matteosecli commented 6 years ago

Regarding the saving location of notes, I think we could put each query (DocumentNotes and Documents.note in a try-exception block, so if either one fails it must be the other one.

I agree in principle; I'm still not sure if there could be notes saved in both places. Maybe you created some notes with an older version, which are stored in DocumentNotes, and then upgraded to a newer version of Mendeley and created other notes (which ended up in the "new" location Documents.note). So maybe this kind of database could have both kind of notes. I'm pretty sure that if instead you re-synchronize your database from scratch, then the new location is used during the creation of the database.

The formatting tags, I agree that it would be nicer to normalize to the HTML formats. I'm thinking doing that using regex, to replace Bold</m:bold> -> Bold, shouldn't be too tricky I think.

I think it would be perfect!


PS: I've updated my previous comment with the new info I've got.

Xunius commented 6 years ago

Hi matteosecli,

New push removes the m: from <m:tags> and addresses the highlight/note author issue (I didn't use the global dictionary method you suggested, as I found no notable speed difference querying 2 more sqlite columns), but you'll have to do a bit of test for me as I don't have a cooperative author. Many thanks!

matteosecli commented 6 years ago

Hi @Xunius,

I've tested the latest master version and that's great! Now it works correctly, as far as I've tested (multi-documents and multi-users).

With the latest version, I think I can close https://github.com/Xunius/Menotexport/issues/26 and https://github.com/Xunius/Menotexport/pull/27 as well and focus on this issue.

matteosecli commented 6 years ago

I've opened a PR that should close this issue when merged.

I'll open another issue to discuss the part about the styling of annotations, that should (at least partially) solve appearance inconsistencies between different readers (but that will require a bit of thinking and discussion).