Xunius / Menotexport

Python solution to export annotations from your Mendeley library.
GNU General Public License v3.0
124 stars 20 forks source link

Export author in highlights and sticky notes #27

Closed matteosecli closed 6 years ago

matteosecli commented 6 years ago

I've noticed that, by opening an exported PDF file with a PDF reader, the author field was empty. So I looked into my Mendeley database and I've found that the author field was empty, too; however, there was a profileUuid that links to profile information (& that doesn't seem to change even if you change your email address).

I've added this information into the query to the database; the author is by default the author field in the database; however, if this field is empty, the author's name is constructed by merging the firstName and lastName fields of the database.

I've also added an author field to the highlights dictionary because it was completely missing, although the relevant function pdfannotation.createHighlight() already had all the things in place.

I've tested the changes with my database and Mendeley 1.18, although I don't know whether these fields were present or not in much older versions of Mendeley's databases (imho I don't think so, because an account was required even for much older versions; I've checked on Mendeley 1.17.11 and they are there).

I'm not a Python or even more a SQL expert (never used SQL in my life), so I apologize in advance for any mistakes!


PS: The ordering of the new fields looks a bit random, but it's just because I've tried to change the code as little as possible.

Xunius commented 6 years ago

Hi matteosecli, Many thanks for the work. Just to clarify, by 'author' I think you meant the author that created the highlights, not the article? In that case, I've already retrieved author in the getUserName() function in menotexport.py, and the author field is then passed onto the meta attribute associated with extracted annotations and can be accessed from there easily. So no need to do that again in the queries you modified.

I haven't tested, but I think the only change needed is something like this in exportPdf():

anno=pdfannotation.createHighlight(hjj['rect'], cdate=hjj['cdate'], color=hjj['color'], author=annotations.meta['user_name'])

and similar to the createNote() part.

Would you like to give it a try and see that does what you want?

matteosecli commented 6 years ago

Hi @Xunius, sorry for the (very) late reply; I was waiting for a friend of mine to help me carrying out some tests.

I've set up a collaborative folder in Mendeley in order to test what happens when multiple people annotate the same document; I'll go step by step.

Just to clarify, by 'author' I think you meant the author that created the highlights, not the article?

Yes, by 'author' I mean the Mendeley user that created the highlight, not the author of the article.

In that case, I've already retrieved author in the getUserName() function in menotexport.py, and the author field is then passed onto the meta attribute associated with extracted annotations and can be accessed from there easily.

I have to say that I actually missed that! However, there are a couple of objections:

As a final note, I've realized that if firstName or lastName were NULL, the script was crashing on the lines that join them – the ones that I've added. So, I've slightly modified the joining procedure in order to skip these NULL fields – and so far in my tests, it seems to work.

Xunius commented 6 years ago

Hi matteosecli,

Many thanks for coming back and all the work. I've merged your PR #28. I didn't do collaboration in Mendeley so that's a use case that has been largely neglected, and I think you made a valid point that the highlight authors (and note author as well, right?) would be different in that use case. But the thing is, I thought you won't be coming back so I've made quite some changes in the code (to address the multiple attachment issue) and now the getHighlights(), getNotes() and some other functions look bit different. To be honest I'm not quite experienced in handling conflicting merges, I believe your changes will be based on an old base should I commit my changes. So, do you think it would be better to let me finish my multi-attachment issue fix and incorporate your changes myself (to save your time), or you wait for my commit and do your changes again (so we retain your credits)?

Thanks again for the contribution.

matteosecli commented 6 years ago

By multiple attachment issue you are referring to https://github.com/Xunius/Menotexport/issues/26, right?

Anyway, for me it's ok either way! I don't know the extent of the new changes; I think this PR can wait until you commit all the new changes related to that issue and then see if has conflicts or not at that point. If it has conflicts which are better resolved by rewriting these changes from scratch and you think it would be more efficient for you to directly incorporate them, it's totally fine for me! Or if you don't have time I can make a new PR as well. 😉

So, I'd say to check back here once you commit those changes.

In the meantime, I'm looking into something else – which I hope I can report in a new thread for a discussion as soon as I have some minimal working examples/data.

Xunius commented 6 years ago

Hi matteosecli

I've implemented your suggested changes (with minor differences) and pushed. Here is what I did:

Would you like to give it a test and see it works as intended?

matteosecli commented 6 years ago

Hi @Xunius,

I've tested the latest master version. Now all the highlights have a non-empty author, but the problem is that the author is always myself even if the highlight was added by another person!

I'll try to explain better, maybe I was a bit messy last time. What I was doing was the following:

So, your lines https://github.com/Xunius/Menotexport/blob/6a8ba455744fb0c4217832980b7d87dd570cb610/menotexport.py#L453-L455 always set myself as the author of every highlight. Instead the change I was proposing, i.e. the lines https://github.com/Xunius/Menotexport/blob/fa6f6448cf3d02858de3b00f463d0c2c609773d5/menotexport.py#L411-L414 set the name linked to the highlight's profileUuid as the correct author of the highlight.

So

What do you think it would be better to do? If you don't want to rewrite these bits and you are not planning to do any incompatible changes, I can do these modifications, test them, and send them as a PR in the next few days. If instead you prefer to write directly the code, I'm always here to test anyways. 😉

Xunius commented 6 years ago

I got your points now, because I never did co-authoring before so I thought that's why my FileHighlights.author is always empty. It appears that it's indeed necessary to query the Profiles table, but I wasn't expecting doing this would make it much slower. My profiling suggests the slowest parts are those relating to PDF processes via pypdf2 and pdfminer, I tried to multi-thread some function calls but only get negligible speed gain (maybe I'm doing something wrong). But your dictionary idea sounds good.

I have some free time tomorrow, so how about me re-doing these changes as we understand each other quite well already, and I'll let you continue on the "Wrong coordinate ordering in "Rectangle Highlight"" fix (https://github.com/Xunius/Menotexport/issues/29) as you have a much better understanding in that regard?

matteosecli commented 6 years ago

I have some free time tomorrow, so how about me re-doing these changes as we understand each other quite well already, and I'll let you continue on the "Wrong coordinate ordering in "Rectangle Highlight"" fix (#29) as you have a much better understanding in that regard?

That sounds totally fine for me! 😃

matteosecli commented 6 years ago

Closing since these changes, with the discussion that followed in this thread, were reimplemented by @Xunius in https://github.com/Xunius/Menotexport/commit/6a8ba455744fb0c4217832980b7d87dd570cb610 and https://github.com/Xunius/Menotexport/commit/e6eebeba170356b9a0f31dc22bb7621d2dd93138.