Building a comprehensive dataset of patent citations
๐ฉโ๐ฌ Exploring the universe of patent citations has never been easier. No more complicated data set-up, memory issue and queries running for ever, we host patCit on BigQuery for you.
๐ค patCit is community driven and benefits from the suppport of a reactive team who is eager happy to help and tackle your next request. This is where academics and industry practitioners meet.
๐ฎ patCit is based on state-of-the-art open source projects and libraries such as grobid/biblio-glutton and spaCy. Even better, patCit is continuously improving with the rest of its ecosystem.
๐ Want to know more? Read patCit academic presentation or dive into usage and technical guides on patCit documentation website.
๐ Receive project updates in your mails/gitHub feed, join the patCit newsletter and star the repository on gitHub.
Patents are at the crossroads of many innovation nodes: science, open knwoledge, products, competition, etc. At patCit, we are building a comprehensive dataset of patent citations to help the community explore this terra incognita. patCit is:
๐ก How we do? We use recent progress in Natural Language Processing (NLP) to extract and structure citations into actionable piece of information.
patCit builds on DOCDB, the largest database of Non Patent Literature (NPL) citations. First, we deduplicate this corpus and organize it into 10 categories. Then, we design and apply category specific information extraction models using spaCy. Eventually, when possible, we enrich the data using external domain specific high quality databases.
patCit builds on Google Patents corpus of USPTO full-text patents. First, we extract patent and bibliographical reference citations. Then, we parse detected in-text citations into a series of category dependent attributes using [grobid][grobid. Patent citations are matched with a standard publication number using the Google Patents matching API and bibliographical references are matched with a DOI using biblio-glutton. Eventually, when possible, we enrich the data using external domain specific high quality databases.
Category | Citation extraction | Information extraction | Enrichment | BigQuery table | Colab notebook |
---|---|---|---|---|---|
Bibliographical reference | โ |
โ |
โ |
๐ |
|
Patents | โ |
โ |
โ |
๐ |
๐ Find - The patCit dataset is available on BigQuery in an interactive environment. For those who have a smattering of SQL, this is the perfect place to explore the data. It can also be downloaded on Zenodo.
๐จโ๐ If you are new to BigQuery and want to learn the basics of Google BigQuery (GBQ), you can take the GBQ Quickstart. This should not take more than 2 minutes and might help a lot !
๐ Access - We maintain a detailed documentation on how to access the data once you have found them on BigQuery or Zenodo. See usage notes on the patCit documentation website.
๐ Interoperate - Interoperability is at the core of patCit ambition. We take care to extract unique identifiers whenever it is possible to enable data enrichment for domain specific high quality databases. This includes the DOI, PMID and PMCID for bibliographical references, the Technical Doc Number for standards, the Accession Number for Genetic databases, the publication number for PATSTAT and Claims, etc. See specific table for more details.
๐ Reproduce - You are at the right place. This gitHub repository is the project factory. You can learn more about data recipes and models on the patCit documentation website.
There are many ways to contribute to patCit, many do not include coding.
Give feedback - We want to make patCit truly useful to the community. We are thus very happy for feedback.
Share your thoughts - We believe that discussions are much more valuable if they are publicly shared. This way, everyone can benefit from it. Hence, we strongly encourage you to share your issues and request on patCit GitHub repository issue section.
Feel like coding today? - We will be more than happy to receive any contributions from you and the community. We have already started to tag some issues with and .
This project was initiated by Gaรฉtan de Rassenfosse (EPFL) and Cyril Verluise (Collรจge de France) in 2019.
Since then, it has benefited from the contributions of Gabriele Cristelli (EPFL), Francesco Gerotto (Sciences Po), Kyle Higham (Hitsotsubashi University) and Lucas Violon (HEC Paris).
We are also thankful to Domenico Golzio for constant support and to @leflix311, @kermitt2, Tim Simcoe (Boston University) @SuperMayo and @wetherbeei for helpful comments.
Contribution details are available in CRediT.