I am working on a spark job to parse the patent documents into a RDD existing of simple elements like the documentID, its citations and description etc. and try to create a graph out of it. However, i can't get any information about the citations, meaning getCitations() is returning a empty list. My codebit is:
val documentsRDD = sc.parallelize(documentsList)
print("\nConverting finished. Length of RDD: " + documentsRDD.count())
val LambdaWrapperForSerialization = () => new PatentReader(bulkFormat)
print("\nParsing documents with patent reader")
val documentsRDDparsed = documentsRDD.flatMap(p => {
val sReader: StringReader = new StringReader(p)
val patent: Try[Patent] = Try(LambdaWrapperForSerialization.apply().read(sReader))
val out = patent match {
case Success(s) => {
val patentId: String = s.getDocumentId().toText
val patentDesc: String = s.getDescription().toString
val patentTitle: String = s.getTitle()
val patentCits: String = s.getCitations().asScala.toList.map(c => {
c.getCitType match {
case CitationType.PATCIT => c.asInstanceOf[PatCitation].getDocumentId.toString()
case CitationType.NPLCIT => c.asInstanceOf[NplCitation].getNum()
case default => "No citation type matching"
}
}).mkString(";")
Some(patentId, patentTitle, patentCits, patentDesc)
}
case Failure(f) => {
print("Failure: " + f)
None
}
}
out
})
All the other fields work fine. The bulk file i'm importing consists of around 6000 documents of the first week this year (ca. 700MB). Am I missing something out here?
I am working on a spark job to parse the patent documents into a RDD existing of simple elements like the documentID, its citations and description etc. and try to create a graph out of it. However, i can't get any information about the citations, meaning
getCitations()
is returning a empty list. My codebit is:All the other fields work fine. The bulk file i'm importing consists of around 6000 documents of the first week this year (ca. 700MB). Am I missing something out here?