Support automatic citation extraction from PDF attachments

diegodlh commented 3 years ago

Include Grobid and Scholarcy Reference Extraction API. See corresponding section in grant proposal.

diegodlh commented 2 years ago

An additional gateway to a locally running instance of Excite may be of interest to @cboulanger.

cboulanger commented 2 years ago

The whole business of running local servers providing PDF extraction and citation matching has become much easier by Docker. See for example here for running Grobid and Reference matching servers: https://github.com/kermitt2/biblio-glutton#running-with-docker

cboulanger commented 2 years ago

It might be enough to define an registration API into which backend connectors then can hook into via separate plugins - one for PDF extraction (e.g., given a PDF, return Zotero JSON), and one for matching (e.g., given an array of Zotero JSON items, return an array of arrays of Zotero JSON items containing best matches). I would leave implementation completely outside of CITA.

cboulanger commented 2 years ago

Continuing this thought, what about this: why not implementing an experimental internal API in https://github.com/diegodlh/zotero-cita/blob/master/src/extract.js without GUI yet (so that no false expectations are created). This API is then made accessible in the "Run Javascript" console (I have just asked about how to "require()" plugin classes there) so one can run tests with it. Then, we create a plugin with a reference mock implementation which simply return static content. On the basis of this, extraction service add-ons can be developed without CITA core having to do anything. Once at least one such add-on has matured and delivers reliable results, the internal API can be frozen (and maybe versioned) and made accessible via the GUI. I can envision that this will enable the implementation of a GROBID service based on its Docker image pretty quickly, and I will be very much interested in providing an "EXcite" add-on.

Here's a first idea for a minimalistic API (type-scripty pseudo-code):


interface Item {
    // Zotero item data
}

interface AbstractConnector {
    get id: string // unique identifier
    get label: string // label to be translated, I don't know how this works in Zotero
    connect(): Promise<void> // throws if no connection can be established
}

interface ExtractionConnector extends AbstractConnector {
    extract(pdfFile: File): Promise<Item[]> 
}

interface MatchConnector extends AbstractConnector {
    match(item: Item): Promise<Item[]>
}

interface AbstractRegistry {
    connectors: []
    register(connector: AbstractConnector): void  // stores the connector
}

class ExtractorRegistry extends AbstractRegistry {}
class MatcherRegistry extends AbstractRegistry {}

export default class Extraction{
    /**
     * registers connector depending on its type to ExtractorRegistry or MatcherRegistry
     */
    static register(connector: AbstractConnector): void

    /**
     * Extracts references from a given attachment item, using the given extractor. Will use registered matcher
     * connectors to look up unique ids (DOI, ISBN, etc.) and complete/correct metadata for the extracted
     * references 
     * @param extractorId the id of the extractor, from some UI setting 
     *      or manually passed to the method in the console
     * @param item a Zotero.Item object of type attachment from which a 
     *      PDF file can be retrieved (either stored locally or by way of download)
     */
    static extract(extractorId: string, item: Zotero.Item): Promise<Item[]> 

    /**
     * @param matcherId the id of the extractor, from some UI setting 
     * or manually passed to the method in the console
     * @param item Zotero JSON item from reference extraction (or somewhere else)
     */
    static match(matcherId: string, item: Item): Promise<Item[]> 

}

Maybe it might make sense to separate matching from extraction since it could be called outside of extraction with items that already exists in Zotero.

Any thoughts?

cboulanger commented 2 years ago

An implementation using https://ref.scholarcy.com/api/ could be done pretty quickly, I think.

Dominic-DallOsto commented 1 year ago

A rough outline of the required steps to get this up and running. The main issue is finding a service to do the reference extraction from full-text documents. After that, wiring into the Cita workflow should be pretty straightforward, maybe with a validation step.

[ ] Test different citation extraction services (eg. Grobid, Scholarcy, but I guess there are many more) - how well do they perform on a selection of PDFs?
- [ ] Easiest for us would be an online API (like Crossref or Wikidata) - is this available?
- [ ] Otherwise, we could potentially setup a server using Wikimedia's infrastructure that runs the service of choice
- [ ] As a last resort, we could run a service locally, but that would require more setup for users. This could even be a separate addon.
[ ] When selected, get an attachment item and send to this service (do we only support PDFs? Are other formats necessary?)
[ ] Parse the returned references
[ ] Potentially have a validation step - showing PDF text and parsed output for each reference

HughP commented 1 year ago

In this workflow, an initial step might be to check if this document has already had its citations extracted. These citations may be stored locally or in a larger graph somewhere like crossref.

On Sat, Aug 27, 2022 at 9:02 AM Dominic D @.***> wrote:

A rough outline of the required steps to get this up and running. The main issue is finding a service to do the reference extraction from full-text documents. After that, wiring into the Cita workflow should be pretty straightforward, maybe with a validation step.

Test different citation extraction services (eg. Grobid https://github.com/kermitt2/grobid, Scholarcy http://ref.scholarcy.com/api/, but I guess there are many more) - how well do they perform on a selection of PDFs?

Easiest for us would be an online API (like Crossref or Wikidata)

is this available?

Otherwise, we could potentially setup a server using Wikimedia's infrastructure that runs the service of choice

As a last resort, we could run a service locally, but that would require more setup for users. This could even be a separate addon.

When selected, get an attachment item and send to this service (do we only support PDFs? Are other formats necessary?)

Parse the returned references

Potentially have a validation step - showing PDF text and parsed output for each reference

— Reply to this email directly, view it on GitHub https://github.com/diegodlh/zotero-cita/issues/61#issuecomment-1229198118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAJ2JTNYWJTIGDNWWW6X2TV3INZFANCNFSM42TK7NPQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- All the best, -Hugh

Sent from my iPhone

cboulanger commented 1 year ago

I have been working on an extraction workflow based on https://github.com/inukshuk/anystyle , which is a lightweight alternative to GROBID.[1]

My suggestion would be to define a minimal REST API which can be used with both an endpoint that you set up or a port on localhost - so that people running their own extraction servers with custom models can intergrate them into CITA.

[1] There is a general problem with extracting citations from Humanities scholarship which has no bibliography but puts all citation information in the footnotes. All of the existing solutions perform very badly with such literature, but I have promising results with AnyStyle based on a dataset of annotated documents for training a custom model. GROBID requires more complex training material because of the important it places on document layout.

Dominic-DallOsto commented 1 year ago

Ahh nice, and it's lightweight enough to run locally in a Zotero add on?

Yeah, I guess if we define some sort of minimal API an extraction service should provide (ie. given a PDF, return the list of references in bibtex or some other format). Then ideally some sort of discovery process where Cita can find a list of available/running extraction services would improve the experience.

I guess it would be nice if we could have a lightweight extraction service living either in Cita itself or a Zotero addon for ease of installation / getting things up and running? Then users who desire can run more advanced citation extraction services and those will still play nice with Cita?

cboulanger commented 1 year ago

AnyStyle is written in Ruby and depends on a C extension (which makes it VERY fast) so as long as we cannot just transpile those two things to JavaScript I fear we are out of luck as far as a Zotero plugin is concerned. However, it is really trivial (at least once you have learned Ruby like I had to) to expose the desired functionality in a Docker container which can be deployed very easily locally or on a server.

retorquere commented 1 year ago

Technically it should be possible to compile c to wasm feed call that but I've found this kind of thing non-trivial.

cboulanger commented 1 year ago

It would require to compile https://wapiti.limsi.fr/ to javascript, which is a library that is unfortunately no longer developed but performs really well compared to a CRF python implementation that I had been working before, plus translating the Ruby Bridge https://github.com/inukshuk/wapiti-ruby and the citation extraction library https://github.com/inukshuk/anystyle from Ruby to Javascript - quite an endeavour. What is really nice about AnyStyle is that it is very well & cleanly written (even though in-code documentation is largely missing), so a translation into JavaScript should be quite straightforward.

cboulanger commented 1 year ago

Pinging @inukshuk - maybe he has an opinion on this.

cboulanger commented 1 year ago

Just saw this: https://www.ruby2js.com/

Dominic-DallOsto commented 1 year ago

Ok, that seems like an endeavour, but feasible at least.

It'd be nice to have at least one extraction service run out of the box in Zotero (for the average user who doesn't want to setup docker & so on) - whether that be something we get to run locally in JS, or can host on a server somewhere.

I guess a PDF->reference list service could be of interest to wikipedia/wikidata people more generally, so maybe we could organise some way to host it if that's the best option.

retorquere commented 1 year ago

With Heroku free gone, I don't readily know a service where you can host something like this with a controlled cost.

Transpiling between language idioms frequently yields convoluted results for all but the most trivial examples. It's good for cases where the source stays Ruby, but for migration, my experiences weren't great.

lightgivener commented 1 year ago

Hey all, I am a humanities researcher / student who is using Cita and have been in contact with Diego before. I do know a few people in lead university library IT and at this point might just reach out to them with something specific like citation extraction. I am also presenting at the International Art Libraries conference in the Fall and will reference Cita and discuss citation tracing for earlier 19th and 20th century sources which are often digitized but where citation extraction is even harder due to script and other variations. Some library heads and IT people will be there. Just wanted to let you know and citation PDF (or scan) extraction is specific and important enough to possibly get some experienced people interested if you are interested in that.

Cheers

Dominic-DallOsto commented 1 year ago

Hey all, I am a humanities researcher / student who is using Cita and have been in contact with Diego before. I do know a few people in lead university library IT and at this point might just reach out to them with something specific like citation extraction. I am also presenting at the International Art Libraries conference in the Fall and will reference Cita and discuss citation tracing for earlier 19th and 20th century sources which are often digitized but where citation extraction is even harder due to script and other variations. Some library heads and IT people will be there. Just wanted to let you know and citation PDF (or scan) extraction is specific and important enough to possibly get some experienced people interested if you are interested in that. Cheers

Thanks a lot! Yeah, you're right - citation extraction is a problem in and of itself, not only for Cita. If you know more people working on this problem (who are nice enough to host a free service online 😛) that'd be great. Or in general, any efforts in this direction to produce a solution are something we should align with. Then Cita's focus should just be to make this accessible to the average Zotero user.

Any experience you have with this landscape would be super valuable. And if there's anything we can do to help - let us know.

baaaadegg commented 4 months ago

Hello,

I am a researcher working on a project that seeks bibliometric data on a large set of journal articles. Number of references per article is one of the measurements of focus.

I am unacquainted with GROBID, but from the documentation it seems it may accomplish what we need. (See the reference segmenter feature): https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/

Equally promising, I have discovered the 'zotero-reference' plugin that parses and counts the references from a pdf in a Zotero library. The lacking key feature of this plugin though is the option to view 'reference count' as a column in Zotero's main library view. This is crucial given that our study will compare a multitude of bibliographic measurements at once.

Please see the zotero-reference project here (and Google/Chrome translate if necessary): https://github.com/MuiseDestiny/zotero-reference

Fwiw, they seem to be open collaborators and are responsive to requests (see user 'polygon'): https://forums.zotero.org/discussion/comment/429031#Comment_429031

Writing here to notify the Cita developers of the need, and requesting if there are still plans to incorporate this feature in the Zotero plugin via GROBID, Scholarcy, or other.

Thank you!

diegodlh / zotero-cita

Support automatic citation extraction from PDF attachments #61