USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
182 stars 80 forks source link

Returning no / empty Citations #54

Closed kesslmar closed 7 years ago

kesslmar commented 7 years ago

I am working on a spark job to parse the patent documents into a RDD existing of simple elements like the documentID, its citations and description etc. and try to create a graph out of it. However, i can't get any information about the citations, meaning getCitations() is returning a empty list. My codebit is:

val documentsRDD = sc.parallelize(documentsList)
print("\nConverting finished. Length of RDD: " + documentsRDD.count())

val LambdaWrapperForSerialization = () => new PatentReader(bulkFormat)

print("\nParsing documents with patent reader")
val documentsRDDparsed = documentsRDD.flatMap(p => {

  val sReader: StringReader = new StringReader(p)
  val patent: Try[Patent] = Try(LambdaWrapperForSerialization.apply().read(sReader))

  val out = patent match {
    case Success(s) => {
      val patentId: String = s.getDocumentId().toText
      val patentDesc: String = s.getDescription().toString
      val patentTitle: String = s.getTitle()

      val patentCits: String = s.getCitations().asScala.toList.map(c => {
        c.getCitType match {
          case CitationType.PATCIT => c.asInstanceOf[PatCitation].getDocumentId.toString()
          case CitationType.NPLCIT => c.asInstanceOf[NplCitation].getNum()
          case default => "No citation type matching"
        }
      }).mkString(";")

      Some(patentId, patentTitle, patentCits, patentDesc)
    }
    case Failure(f) => {
      print("Failure: " + f)
      None
    }
  }

   out
 })

All the other fields work fine. The bulk file i'm importing consists of around 6000 documents of the first week this year (ca. 700MB). Am I missing something out here?

bgfeldm commented 7 years ago

Are you indexing Patent Application "pgpubs" or Patent Grants? Because only Grants, have citations. Also, Design Patents often don't have citations.