Bringing Crossref, Semantic Scholar, Open Citations and Open Alex lookup + auto-import + redesign to Cita for Zotero 7

thebluepotato commented 2 months ago

Fixes #51, fixes #134, fixes #168, fixes #173

Hi! I adapted the (now stale) PR #139 to the new Zotero 7 branch so it has a chance to be swept up in the new release. The general logic is unchanged from the other PR, but I've made quite a few updates for efficiency, code clarity and type safety as well as fixed a few failing Promises here and there. I've tested quite a bit already, but it could definitely use more in-depth testing.

And I've also added a button to citations to auto-import that reference into Zotero with one click and then link it. It's similar to what https://github.com/MuiseDestiny/zotero-reference does, but I find that addon confusing at best and it doesn't help that all the info is in Mandarin Chinese...

All in all, probably still a WIP, but happy to receive code reviews and have some people test this!

Dominic-DallOsto commented 1 month ago

Thanks a lot for this!

It'll take me a little while to review this in detail sorry, but this is great!

thebluepotato commented 1 month ago

One thing that could/should be considered, is that while adding the references for which Crossref has a DOI or ISBN is quite robust, adding items as book or journal merely on the title that they have is unsatisfactory. For instance, with DOI:10.1145/2786451.2786465, some of the references are sections from the same book (a different author per section), yet they all appear in Crossref as Author + Book title (instead of section title). Maybe it should be up to the user to enable what is actually imported.

thebluepotato commented 1 month ago

The latest commit adds a new IndexerBase abstract class that abstracts the common logic between various "indexers" (couldn't think of a better name). This allows us to more simply add various such "indexers", which now includes Semantic Scholar and Open Alex as well. They all have their pros and cons, but this should give the user a lot of options to automatically fetch these citations.

Based on initial (limited) experimentation:

Crossref: citations seem more "official" than the other sources, but not all items with DOIs have references
Semantic Scholar: because it analyses the indexed papers, it includes many references, but also some random entries that are not actually cited
Open Alex: has usually fewer citations than the others

One issue that this "abstraction" brings is that the context menu when clicking on an item shows the translation keys instead of the corresponding strings.

Dominic-DallOsto commented 1 month ago

Hi, I just had a chance to quickly test this and so far things look nice, thanks so much! I haven't been able to fully review the code yet but here are some observations from testing.

Openalex build error

I get the following build error at the moment because of the openalex-sdk. Did you encounter this on your end?

    node_modules/openalex-sdk/dist/src/utils/works.js:7:37:
      7 │ const fs_1 = __importDefault(require("fs"));
        ╵                                      ~~~~

  The package "fs" wasn't found on the file system but is built into node. Are you trying to bundle
  for node? You can use "platform: 'node'" to do that, which will remove this error.

I removed the openalex SDK to test a bit further.

Auto import citations

Firstly, the auto import by identifier button is really nice! It would solve #40. One thing that might also be nice is, if the citation already has a QID attached, that this should be applied to the newly created item when it's imported?

Getting Crossref citations

Testing the auto import of citations from crossref I found some bugs, but they're mostly related to crossref's data so it was just unlucky I happened to pick a bad item haha

Add this item by DOI - 10.1007/BF01700692
Get citations from crossref
- newlines in text aren't rendered properly
- it says I will get 64 citations

Press OK
- actually I only get 2 citations, and they're both the same
  - Checking the API response, this is actually a crossref problem:
  - we get a response with 64 citations, but 62 are unstructured - maybe the message could be edited to exclude unstructured citations if we don't attempt to parse them, or a message after importing could say "imported 2/64 citations from crossref"
  - here crossref is just a bit strange in that 2 of the references have the same DOI. Could we check for duplicates within the crossref response and remove them?

Getting Semantic Scholar citations

I tested with using the item with DOI - 10.1109/ITW.2015.7133169. It got 11/14 citations because 3 had no identifiers in semantic scholar. The request was very slow though compared to getting citations from crossref. Here is an overview of the timing.

The slowdown is because the requests to arxiv are really slow. I tested the same request in the browser and it also took ~10 seconds to complete, so it doesn't seem that this is problem with Cita. Does arxiv have an alternative (faster) API? Maybe a workaround would be to update the progress message with the number of citations already downloaded, so users can see that things are progressing?

thebluepotato commented 1 month ago

Openalex build error

I get the following build error at the moment because of the openalex-sdk. Did you encounter this on your end?

Yes sorry, I'm actually entirely new to npm so I forgot to commit the patch to openalex-sdk, fixed in latest commit.

Auto import citations

Firstly, the auto import by identifier button is really nice! It would solve #40. One thing that might also be nice is, if the citation already has a QID attached, that this should be applied to the newly created item when it's imported?

I didn't really look into the Wikidata side of things, but will definitely look into ensuring the QID is imported as well. Is it usually stored in the Extra field?

[x] Import QID

Getting Crossref citations

Testing the auto import of citations from crossref I found some bugs, but they're mostly related to crossref's data so it was just unlucky I happened to pick a bad item haha

[x] Get newlines to show in the alert
[x] Rephrase alert to clarify (parsed does not mean the citations will be added in the end, rephrase)
[x] Apply duplicate filter to the citations to be added as well

Getting Semantic Scholar citations

I tested with using the item with DOI - 10.1109/ITW.2015.7133169. It got 11/14 citations because 3 had no identifiers in semantic scholar. The request was very slow though compared to getting citations from crossref. Here is an overview of the timing.

As it currently stands, the PR relies heavily on Zotero's own existing translators to avoid doing too much heavy lifting and to avoid code duplication. Therefore, if it's slow to import with Cita, it's also slow to import when using the "magic wand" tool that imports items based on their identifiers. Will look into alternatives, but it seems likely that Zotero's own translator is already quite optimized as it is.

thebluepotato commented 1 month ago

Regarding arXiv, I updated the translator locally (see: https://github.com/zotero/translators/pull/3366) to use another endpoint which, based on limited testing, should be faster than the one the translator currently uses. However, when testing within Cita, it's just as slow...

EDIT: rather, depending on luck I guess, it can be as "fast" as 1s per request, but still can sometimes be as slow as the other endpoint.

Dominic-DallOsto commented 1 month ago

That's great, thanks a lot! And thanks for addressing the issues with the arXiv translator, doing it upstream in Zotero is definitely the right way.

A couple of little things I noticed:

If I right click an item, in the Cita menu it says "Get citations from Semantic" instead of "Get citations from Semantic Scholar" like it says in the More... menu
If I have an item that only has as ISBN, in the right click menu all the options for getting citations are still enabled, whereas in the More... menu they're all rightfully disabled

Otherwise this all looks good

thebluepotato commented 1 month ago

Got a little crazy and added OpenCitations capabilities again. However, within all the confusion, I need your input on whether we could/should expand the definition of PIDType to include all "IDs" we're now using and that the various indexers support searching for, or at least OpenAlex identifier and Semantic Scholar Corpus ID. In particular, it would streamline the code by using getPID everywhere

thebluepotato commented 1 month ago

If I have an item that only has as ISBN, in the right click menu all the options for getting citations are still enabled, whereas in the More... menu they're all rightfully disabled

For this, I'd like to improve the logic so it is only disabled when no supported identifiers are present. While CrossRef requires a DOI, the other indexers often can search with more identifiers.

Dominic-DallOsto commented 1 month ago

Got a little crazy and added OpenCitations capabilities again. However, within all the confusion, I need your input on whether we could/should expand the definition of PIDType to include all "IDs" we're now using and that the various indexers support searching for, or at least OpenAlex identifier and Semantic Scholar Corpus ID. In particular, it would streamline the code by using getPID everywhere

Yeah, I think that's great to abstract this out like you have.

For this, I'd like to improve the logic so it is only disabled when no supported identifiers are present. While CrossRef requires a DOI, the other indexers often can search with more identifiers.

Yeah, that makes sense. I guess how you've set it up you could just check whether IndexerBase.extractSupportedUID returns null? Maybe it'd be nice to have a specific function that does this check.

Dominic-DallOsto commented 1 month ago

Do you get the same styling problem as me with the PID rows now?

If I make it so all the identifiers are visible, it looks a bit weird but I guess OK

Also, do you think it makes sense to grey out the fetch icons for identifiers that can't be fetched? I think it is more intuitive than clicking the button and then finding out that it isn't supported?

Additionally, could fetching the OMID and OPENALEX ids give a progress popup similar to fetching QIDs? I found that this took a few seconds to run so I wasn't sure whether anything until the identifier finally appeared.

thebluepotato commented 1 month ago

Do you get the same styling problem as me with the PID rows now?

Yes I do, I guess we should also no longer uppercase them all. Should we hide PMID and PMCID from this view? They're supported for searching and all, but I don't think you can get citations from them, so there's no need to highlight them as much.

Also, do you think it makes sense to grey out the fetch icons for identifiers that can't be fetched? I think it is more intuitive than clicking the button and then finding out that it isn't supported?

That'll probably encourage us to further abstract the checking logic, good idea!

Additionally, could fetching the OMID and OPENALEX ids give a progress popup similar to fetching QIDs? I found that this took a few seconds to run so I wasn't sure whether anything until the identifier finally appeared.

In my testing it was nearly instant, but we sure can have a progress indicator.

I'll be away for the week so I won't be able to look at this PR much, feel free to tweak it to your liking if you want!

Dominic-DallOsto commented 1 month ago

I played around with things quickly so now they look like this

I'll be away for the week so I won't be able to look at this PR much, feel free to tweak it to your liking if you want!

No worries - thanks a lot for your hard work! I'll try to fully review the code by then and make a roadmap for what we need before merging

Edit: hiding the PMID and PMCID rows makes sense I think, yeah

And the progress messages work great, thanks

[x] localise PID row fetch button "Fetch" text

thebluepotato commented 1 month ago

TODOs:

[x] Implement DOI fetcher inspired by https://github.com/bwiernik/zotero-shortdoi
[x] ~Fetch DOIs from Datacite?~ There's no clear API to do so
[x] ~Auto-generate DOIs from arXiv?~ EDIT: Semantic Scholar does not seem to find documents by arXiv-DOI. Therefore, we should just be happy with the arXiv ID and instead, give the "arXiv" PIDType priority over DOI for Semantic Scholar
[x] Adapt rate-limiting to the specific indexer ~and provide a meaningful error message when request was rate-limited~
[x] Implement Semantic Scholar's Corpus ID as PIDType
[x] When fetching PIDs such as OMID, also add other PIDs if contained in the response (such as DOI)

thebluepotato commented 1 month ago

I was thinking whether a better interface for users now that there are so many PIDs would be to only show the PIDs an item actually has, along with a "+" button that allows you to add a new PID row of a particular type so that you can then specify the PID.

Does that make sense?

It does! As I mentioned, my use case of Cita is quite specific (getting references and automatically importing them as a substitute for Ethereal Reference), so you know best what your users want. For most documents, having a DOI is more than enough anyway, the other identifiers are more for extra reassurance that this document exists on that indexer.

HughP commented 1 month ago

here are two things to note and keep distinct.

arXiv IDs are specific to that repository/archive and are not guaranteed to follow a digital objection if it moves URLs. DOIs are not specific to a digital repository or platform and when the home of a digital object moves (new URL) then it also can move to point to the new location.
the edition of a digital work on arXiv is often a pre-print and may include more than one edition. The final published version is often the version of record and is many times hosted on a different publishing platform, and the DOI points to this "other" edition. Therefore, technically (and very importantly) DOIs and arXiv IDs do not necessarily point to the same digital object, rather two separate, but often related, objects. When citing and referencing works in bibliographic contexts it is important to point to the right object that was actually referenced. Please keep the metadata from these types of objects distinct. It is a disservice to all (both in linked data and in Zotero) to merge these entities.

Kind Regards

thebluepotato commented 1 month ago

here are two things to note and keep distinct.

arXiv IDs are specific to that repository/archive and are not guaranteed to follow a digital objection if it moves URLs. DOIs are not specific to a digital repository or platform and when the home of a digital object moves (new URL) then it also can move to point to the new location.

the edition of a digital work on arXiv is often a pre-print and may include more than one edition. The final published version is often the version of record and is many times hosted on a different publishing platform, and the DOI points to this "other" edition. Therefore, technically (and very importantly) DOIs and arXiv IDs do not necessarily point to the same digital object, rather two separate, but often related, objects. When citing and referencing works in bibliographic contexts it is important to point to the right object that was actually referenced. Please keep the metadata from these types of objects distinct. It is a disservice to all (both in linked data and in Zotero) to merge these entities.

Kind Regards

Hi! Both very valid points, but I'm not sure what you're referring to. Do you mean this comment?

give the "arXiv" PIDType priority over DOI for Semantic Scholar

In that case, we're definitely not merging these identifiers at all, it's merely a concern that if (for some reason) the item is saved in Zotero with the "arXiv" DOI, which is basically just a DOI-ified arXiv ID, Semantic Scholar will not find references for it. My fix only addressed that, only for Semantic Scholar. I'll update the logic so if we have a DOI, we use that for Semantic Scholar as well, unless it's the "arXiv" DOI. As far as I'm aware, Semantic Scholar actually does not treat the preprint and published version differently, so you'd get the same references regardless of the identifier used for lookup. Note that the search for references is necessarily dependent on what these indexers have stored, and often does not match the ground truth of the PDF.

Dominic-DallOsto commented 1 month ago

When I fetch the QID for an item I get 2 progress windows appearing

Dominic-DallOsto commented 1 month ago

I was thinking whether a better interface for users now that there are so many PIDs would be to only show the PIDs an item actually has, along with a "+" button that allows you to add a new PID row of a particular type so that you can then specify the PID. Does that make sense?

It does! As I mentioned, my use case of Cita is quite specific (getting references and automatically importing them as a substitute for Ethereal Reference), so you know best what your users want. For most documents, having a DOI is more than enough anyway, the other identifiers are more for extra reassurance that this document exists on that indexer.

Cool, that sounds good!

thebluepotato commented 1 month ago

I created a PID type that (for now) handles cleanup and URL generation in a more centralized fashion. Switches are mostly unavoidable if we want some form of abstraction and avoid having only concrete types.

thebluepotato commented 1 month ago

Now that we can auto-import from OpenAlex as well, it might me interesting for the user to be able to "refresh" their citations based on a manually found identifier. Example: when you add citations from Crossref, we consider them reliable enough to also add the ones that don't have a DOI or ISBN. However, many of those probably exist on S2 or OpenAlex, for instance, and might benefit from the additional metadata. Since we can save a PID for the citations, it would be nice to also allow "refreshing" the entry's metadata as well.

Dominic-DallOsto commented 1 month ago

I created a PID type that (for now) handles cleanup and URL generation in a more centralized fashion. Switches are mostly unavoidable if we want some form of abstraction and avoid having only concrete types.

Thanks, that looks nice! What do you think about having eg. DOI as a class that extends PID that implements its own fetching/cleaning and so on?

Now that we can auto-import from OpenAlex as well, it might me interesting for the user to be able to "refresh" their citations based on a manually found identifier. Example: when you add citations from Crossref, we consider them reliable enough to also add the ones that don't have a DOI or ISBN. However, many of those probably exist on S2 or OpenAlex, for instance, and might benefit from the additional metadata. Since we can save a PID for the citations, it would be nice to also allow "refreshing" the entry's metadata as well.

Do you mean that we parse items from crossref that don't have an identifier, but we want to be able to add identifiers to them if we for example get citations from openalex and it has an item with the same title/year? So more in line with detecting and merging duplicate citations #37 #25 ? Or did I misunderstand?

Dominic-DallOsto commented 1 month ago

I also just hid PID rows (except for DOI and QID) by default, and added a button to add them back if you want.

The code is a bit horrible but hopefully the functionality is there. Also, if you add a PID row but don't put anything in it, then change to another item - that PID row will be hidden again. I think that's OK though.

Let me know if you have any suggestions / encounter any bugs!

thebluepotato commented 1 month ago

Thanks, that looks nice! What do you think about having eg. DOI as a class that extends PID that implements its own fetching/cleaning and so on?

It's tempting, but then we'd have to strongly type each possible return of functions such as ˋgetPid()`, which might lead to a lot of boilerplate. However, if we can get away with an interface or an abstract class, why not!

Do you mean that we parse items from crossref that don't have an identifier, but we want to be able to add identifiers to them if we for example get citations from openalex and it has an item with the same title/year? So more in line with detecting and merging duplicate citations #37 #25 ? Or did I misunderstand?

It might be related to duplicates but yes, it's for the situation where Crossref has a reference without an identifier. These references often barely have a title or author and while this PR allows for them to be imported from the JSON, they're often low quality. However, if the item in question exists on some other indexer (typically S2 or OA) that has more metadata, we can now load the metadata for that citation, and in turn the OA id allows us to auto-import it as well. For now refreshing and saving works, but I can't get the editor dialog to update the fields after fetching the metadata.

I'm also working on fixing the unimplemented "add as citation to item" functions.

thebluepotato commented 1 month ago

Wow, thank you so much for the detailed list! Mayn of those things I'd notice as well, but that way it's easier to track what's left. I've added some fixes to the design in the commit handling drag-and-drop. Here's some initial feedback on your feedback:

[x] hovering on a reference that spans multiple lines will cause it to changing wrapping which looks strange

Will change the CSS so the buttons appear on top of the citation

[x] the "Go to linked Zotero item" button appears without hovering over a citation, but can't be clicked because it moves when you hover

That was initially unintentional, but I liked knowing which citation were part of my collection already. In the previous design, the Zotero button was "lit up". If we keep it visible, it should indeed stay put when hovering.

[x] "Unlink Zotero item" button looks dark with dark theme

Will test a bit more, but I'll simplify the CSS overall too to avoid duplication. For now it's a big copy-paste from Zotero's own CSS tweaked with ChatGPT

[x] Fetching QID for item shows 2 progress bars

[x] I don't think we can use the plural feature of Fluent yet because of how translatewiki is setup (eg. 1 Citation/2 Citations)

Yes, and for now the citation number doesn't update dynamically. To be fixed.

[ ] PID row button tooltips don't show up in the citations editor

[x] citations editor is too narrow for all 3 buttons by default

[x] after opening the citations editor the buttons continue to be shown even when it isn't being hovered

[x] If you fetch an item's PID the buttons continue to be shown when it isn't hovered

There's a (still slightly buggy) hover reset after drag and drop, will implement something similar for citation editor

[x] The fetch button shows for PIDs that are already defined

That's intentional, since fetch also fetches other PIDs from the provider if they're available. So if the user adds an OpenAlex id manually, we can still fetch the rest. It's definitely counterintuitive though, so maybe it should auto-fetch/validate once the field loses focus and that way we can hide the fetch button?

[x] Add a preference for whether to show citations on one or more lines

[x] Maybe add a separate context menu for exporting

Sure!

[ ] "Export citations to CROCI" doesn't seem to do anything

CROCI is now OpenCitation Meta Index, so yes we can implement syncing back to OC

[x] "Add as citation to" should also show up in the Citations Box context menu

IMO, the option here should rather be "Add Zotero Item as citation here" in the Add menu. While it makes sense to select items in the main window and add them as citations to other items from there, within an Item, it would make more sense to to go the other way

thebluepotato commented 1 month ago

One rather important TODO before I forget:

ReactJS tries to print errors and warnings with console, which is undefined in Zotero, and therefore produces "reference is not defined" errors. I added an ErrorBoundary around the citation box to catch those, but it doesn't help because these errors and warnings still go through a console call. Therefore:

[x] Try to inject/redefine console to redirect ReactJS logs to Zotero

EDIT: I ended up redefining <toolbarbutton> to <div class="toolbarbutton"> because it was too much of a hack to get React to recognize a now obsolete XUL element that Zotero still uses. That was the main cause of these errors. I set up an ErrorBoundary and a logger for the rest.

Dominic-DallOsto commented 4 weeks ago

Amazing, thanks! This is looking really good now!

A couple of small things:

For this item, the plus button shows to add a new PID but nothing happens if I click it

If I have a book with a DOI (eg. 10.1002/047174882X) and I delete this DOI from the PID row, the PID row disappears but it doesn't show up as an option to add it back using the + button

Dragging and dropping behaves a bit strangely

Screencast from 2024-10-23 23-31-10.webm

Dominic-DallOsto commented 4 weeks ago

That's intentional, since fetch also fetches other PIDs from the provider if they're available. So if the user adds an OpenAlex id manually, we can still fetch the rest. It's definitely counterintuitive though, so maybe it should auto-fetch/validate once the field loses focus and that way we can hide the fetch button?

Hmmm, conceptually it seems like the fetching should be linked to the PIDs we want to have rather than the PIDs we do have? So if I want to add an OMID I click the + button, add OMID, then click fetch, and this could use the OpenAlex ID the item has (or fail if the item doesn't have any suitable identifiers). But since I already have the OpenAlex ID it doesn't make sense to fetch it anymore (even though fetching it can also get us other identifiers). Does that make sense do you think?

thebluepotato commented 4 weeks ago

Amazing, thanks! This is looking really good now!

A couple of small things: [...]

I was unable to reproduce any of the three issues. However, from my experience constantly relaunching the extension on save with VSCode, some code relying on events listeners sometimes (always?) remains between relaunches, leading to odd behavior. Can you try again from a freshly started Zotero instance?

EDIT: for the drag and drop issues, could it be that your item doesn't save in-between? I'm not sure its the same on Windows, but on my Mac the (source) item "blinks" once shortly after I drop a citation. Maybe React re-renders the pane before the change was saved to storage? Might be a hook issue.

Hmmm, conceptually it seems like the fetching should be linked to the PIDs we want to have rather than the PIDs we do have? So if I want to add an OMID I click the + button, add OMID, then click fetch, and this could use the OpenAlex ID the item has (or fail if the item doesn't have any suitable identifiers). But since I already have the OpenAlex ID it doesn't make sense to fetch it anymore (even though fetching it can also get us other identifiers). Does that make sense do you think?

My meaning was unclear, sorry. Let's go with an example: https://www.semanticscholar.org/paper/CHIEF-JUSTICE-ROBOTS-Volokh/ef0c3b327a0c82507b764b3d338340ee60d8440d. This (very good) article has no DOI. If you were to add it to Zotero, it would not receive any PID. Had you added it from the paywalled original source, you'd be unable to fetch any references. If you'd then find it on S2, you'd add the CorpusID. Yet, you can't fetch any other PIDs based on a CorpusID. However, the S2 indexer itself of course can fetch items based on the CorpusID and also returns all known PIDs for that item (in our case, a MAG), which then opens the door to fetching references from OpenAlex (there aren't any, but you get my point). TL/DR; fetching a PID from an indexer that returns multiple different PID should set all of them, unless they're already set. I believe this is the most convenient for the user (to have more identifiers, potentially opening up more indexers) and avoids having to fetch from multiple indexers. In the narrow use case I described (items without any DOI that you find on S2), there is the side effect that one can "set and fetch others". But you are entirely right that this is not at all the behavior expect from a button called "fetch". I'll think on it and come back with a better proposal.

Dominic-DallOsto commented 4 weeks ago

Hmm yeah, the first 2 of these issues seem to be affected by restarting Zotero. These are all from a fresh start. It works once but not the second time:

Screencast from 2024-10-24 23-33-48.webm

The same here, also from a fresh start. Maybe the logic I used to check whether there are any hidden rows to be added isn't correct?

Screencast from 2024-10-24 23-37-40.webm

Also from a fresh start - you can see it kind of revert the first drag and drop so I switched items to check. But initiating the first drag already caused the other items to reorder somehow?

Screencast from 2024-10-24 23-41-35.webm

This is on Ubuntu

Dominic-DallOsto commented 4 weeks ago

My meaning was unclear, sorry. Let's go with an example: https://www.semanticscholar.org/paper/CHIEF-JUSTICE-ROBOTS-Volokh/ef0c3b327a0c82507b764b3d338340ee60d8440d. This (very good) article has no DOI. If you were to add it to Zotero, it would not receive any PID. Had you added it from the paywalled original source, you'd be unable to fetch any references. If you'd then find it on S2, you'd add the CorpusID. Yet, you can't fetch any other PIDs based on a CorpusID. However, the S2 indexer itself of course can fetch items based on the CorpusID and also returns all known PIDs for that item (in our case, a MAG), which then opens the door to fetching references from OpenAlex (there aren't any, but you get my point).

That makes sense, I just think that because the item already has a CorpusID, the button to fetch the CorpusID shouldn't be shown anymore. But the user could then add the row for the OpenAlex ID and click fetch (because it's possible to fetch an OpenAlex ID by its CorpusID)? The only thing I find a bit counterintuitive at the moment is that the fetch button is still shown for a PID that is already there, because it actually lets you fetch other PIDs. I hope that makes sense / I'm not confused about something.

Maybe also a "fetch all PIDs" option in the menu could be interesting (maybe also to run this automatically when an item is created), but if it's not easy to implement I don't think it's a big deal.

thebluepotato commented 3 weeks ago

Good! FYI I'm currently refactoring the lookup logic again when working with multiple items. Fetching hundreds of items at once (references for an entire collection for instance) is currently slow and error prone. I'm simplifying the code and try to batch requests as much as possible to avoid overloading the servers.

thebluepotato commented 3 weeks ago

Ok, I fixed the problems with the pid row + button! The drag and drop is still a bit buggy though

For drag and drop, I'm planning to use react-dnd probably. The issues probably come from the fact that I basically used Zotero's code and tried to adapt it to React. I just want the style to match as closely as possible.

thebluepotato commented 3 weeks ago

The latest commits allow for massive queries, such as getting all citations for tens or hundreds of items from a single indexer in one swoop. The reporting and fallback mechanisms are incomplete yet, but all in all, it should work. Most requests are batch requests, allowing for much faster processing, especially if fetching from OpenAlex or Semantic Scholar.

thebluepotato commented 3 weeks ago

Commit https://github.com/diegodlh/zotero-cita/pull/300/commits/2024c53db1a7bf765c301502bf180ea85ac5ee94 move the citation rows to their own component, paving the way for possibly using react-dnd, but initial testing was quite bad. The current intermediary solution might still address the issues you noted and haven't been able to reproduce.

Dominic-DallOsto commented 3 weeks ago

Amazing, that works a lot nicer now! The only thing is that if I drop an item when it isn't in a slot, it disappears. (ignore that the mouse cursor doesn't look like it's moving - that's just a bug in the screen recording)

Screencast from 2024-10-29 23-32-42.webm

thebluepotato commented 3 weeks ago

Amazing, that works a lot nicer now! The only thing is that if I drop an item when it isn't in a slot, it disappears. (ignore that the mouse cursor doesn't look like it's moving - that's just a bug in the screen recording)

Screencast.from.2024-10-29.23-32-42.webm

What I was able to reproduce was when dropping the item in its original place. The function handling drop events returned too early when there was no change in position, before removing the class that "hides" the row when moving. It should be fixed now.

The citation pane should now also correctly reset when reloading the extension, which might also address some issues that were popping up. I abandoned the idea of using react-dnd since you've confirmed that it works better now and more testing still showed that it was quite slow. Maybe this was due to lack of skill, but if what we have works, let's keep it that way.

Dominic-DallOsto commented 3 weeks ago

Thanks, that works really nicely now. But I found that if I drag the citations to the left then they still disappear:

Screencast from 2024-10-31 00-27-55.webm

Hopefully this is the last strange way I'll find to break it hahaha

thebluepotato commented 3 weeks ago

Thanks, that works really nicely now. But I found that if I drag the citations to the left then they still disappear:

Screencast.from.2024-10-31.00-27-55.webm

Hopefully this is the last strange way I'll find to break it hahaha

Should be fixed now, I had previously merged the onDrop and onDragEndhandlers, but onDrop is only fired when dropping on a valid target. Now, onDragEnd handles un-hiding the row.

thebluepotato commented 3 weeks ago

There's still a pesky bug I'm trying to investigate: when fetching citations for an entire library (or a few dozen items at least), the auto-linking callback at the end of the process sometimes, but not always, causes some, but not all, items to have two citation notes instead of one...

thebluepotato commented 3 weeks ago

With the latest improvements in batch requests, I've been able to fetch the citations for around 1000 items (approx 20'000 citations) in around 60s. There's still room for improvement. The fetching citations part took 20s in my test because semantic scholar does not like concurrent requests and the batch size is limited. Parsing those references took 40 seconds, with all of them (a mix of DOIs, PMIDs, arXiv IDs, and MAG) parsed through OpenAlex since alternatives are usually sequential. For example, 200 arXiv IDs took nearly a minute to complete before I found a way to parse them through OpenAlex, where they took only a couple of seconds. We can still tweak the limiter to get the most out of OpenAlex, and tweak some operations. Extracting the identifiers for lookup took 10s, when it's mostly just a map over an array...

So you don't have to sift through the code for the logic of choosing OpenAlex: my gut feeling is that while OpenAlex might not return perfect results in terms of quality, they're good enough to be stored as citations, especially compared to how slow good-quality alternatives (such as DOI content negotiation) would be. The "high-quality" option is preferred when importing those citations into the library.

I've ended up with new issues though: some items have lots of citations, so many that Zotero refuses to sync them because the note is too long. Maybe there's a third option for storage that would be better than notes or the extra field?

thebluepotato commented 3 weeks ago

To clarify the steps and timings:

For all selected items, fetch one identifier compatible with the selected indexer. Usually this is the indexer's own ID (like the OpenAlex ID when we've chosen to fetch citations from OpenAlex). This is very short.
Query citations for all selected identifiers, in batches. The speed depends on the indexer and rate limits. The times I gave above were with Semantic Scholar as the indexer, which does not allow concurrent requests (but has very generous batch sizes). This part took more than 20 seconds, and we can hardly speed this up, given the selected indexer's constraint. Note that this is with an API key for S2. You're still limited to 1 request per second, but without a key you share 1000 requests per second with the entire world...
For all citations fetched, parse look them up. There we also select the "best" identifier, which due to unoptimised code took 10s above, now only a second or less.
The lookup process does not depend on the selected indexer, but on the ID. In theory, we could use Zotero's own lookup, but it processes each indexer sequentially which would be horrendously slow. I've implemented batch fetching for Crossref (but it's limited to DOIs), not yet for Semantic Scholar (but I don't know how good the Zotero translator is for it), so all identifiers except ISBNs end up going through OpenAlex in batches and we send the resulting JSON through the corresponding translator. That part took 30-40s.

So there was already an improvement of 10s, and I'm sure there's some inefficient code hiding around.

As to sending the lookup requests early, it might be doable, but we ask for confirmation based on the total number of citations fetched. Like, do you really want to add 20k citations at once? I'm concerned tracking everything might be a concern too. I'll keep the structure for now and optimise what I can, and then I'll do a comparison between the various indexers. I have a hunch that OpenAlex is definitely the fastest overall simply because we can do requests in parallel (and they're quite fast anyway).

Dominic-DallOsto commented 3 weeks ago

Ok cool, that all makes sense!

And yeah like you say, I don't think people are likely to be adding 20K citations very often, so optimising this extensively probably isn't super important compared to having the process of adding citations for one or a few items being responsive. Worst case, adding information to the progress window like "Fetching citations for item X / Y" should be fine, then it's not really a problem to wait 60 s if you are sure that the operation is still running.

thebluepotato commented 3 weeks ago

Just as an example for an item that has way too many references: https://doi.org/10.1016/j.inffus.2019.12.012 has over 500 references on OpenAlex so we end up with nearly 650,000 characters in the note...

thebluepotato commented 3 weeks ago

A new TODO:

[ ] Detect shortDOIs because those really don't play nice with our indexers

thebluepotato commented 3 weeks ago

A little round of tests I ran:

Indexer	Total items	With ID	Found	With refs	IDs to parse	Fetch [s]	Per ID [ms]	Parse [s]	Per ID [ms]	Extra
Crossref	1,095	617	251	179	3,474	10.0	16.2	14.1	4.06	Also parsed 2216 "manually"
Semantic Scholar	1,095	646	585	451	12,942	53.0	82.0	47.5	3.67	API responses were unusually slow
OpenAlex	1,095	645	641	403	13,924	0.9	1.4	57.1	4.10	Blazing fast and second-highest number of matches
Open Citations	1,095	617	265	265	7,512	123.0	199.4	23.8	3.17	Slow and limited

Fetch is the first phase (getting citations from the indexer) and parse is the second phase (using OpenAlex for 99% of the identifiers in batches followed by Zotero.Translate).

thebluepotato commented 2 weeks ago

The latest commit further improves speed by no longer querying Crossref for items with missing data on OpenAlex. It turns out that that data was missing from Crossref anyway in most cases. The parsing stage now takes only around 25 seconds for 14k references! This also means, in turn, that fetching references for a single item is nearly instant (in terms of parsing, fetching is still somehow sloed down by the choice of indexer).

thebluepotato commented 2 weeks ago

Since we can now look up PIDs en masse, maybe it would make sense to group the "Get citations" and "Get identifiers" into separate submenus?

thebluepotato commented 2 weeks ago

Ahh nice OK. Maybe the quickest solution would just be to detect the maximum note length that works with syncing and split up the data across multipl enotes since each is small enough. Is it possible to sync multiple notes per item though?

I assume you can because I have items with notes (from arXiv for instance) that now have a separate Citations note.

Like under the menu right click item?

Yes, it's currently unlocalised. Within the Cita pane, you can still add a PID and then search it (new icon, same function, haven't fixed the appearance conditions yet).

Dominic-DallOsto commented 2 weeks ago

Yeah, I think we could definitely have submenus under the "Cita" right click menu to make things easier to navigate

diegodlh / zotero-cita