Add USPTO data processing

AggelosMargkas commented 1 year ago

Add processing of USPTO data files.

Add USPTO class containing the metadata, w.r.t. the other plugins.
Add VTable class for the instantiation of virtual tables (for now only one "us_patents").
Add class ZipFiles for processing the directory of the uspto data source.

After opening the directory and iterating over the zip files, it appends the weekly uspto zip file a list "file_path". Afterwards for every weekly zip counts the concatenated xml files inside, which contains multiple small XML files insides. Each of these concatenated XML files represent one US patent grant.

Using a generator send on demand a container id variable representative of the US patent. No parsing is taking place.

Add ItemsCursor class on data_sources.py. Modify Crossref's FileCursor and uspto ChunkCursor's ChunkCursor.
Add class ChunkCursor, for creating a cursor to iterate through the XML chunks.

In this class is used the caching. Using the content and the container id the caching takes place, when moving to the next available chunk. Successful or not, returns an element tree. This element tree object is the current row value.

Create an PatentsElementsCursor successor of ElementsCursror. This class is responsible for the next function that the rest of the USPTO cursors will have.
Add a PatentsCursor for the population of the table 'us_patents'. Points to every row and uses an extracting function to extract the fields needed for the table out of the etree.
Add a PatentsIpcrCurcor for the population of the table 'icpr_classifications' similar to the kid cursors of Crossref. There is not an abstract element_name function to extract sprecidic items as in Crossref. In the case of USPTO the extraction is a getter function that returns the value_extractor on each ColumnMeta object.
Add a data file containing one compressed xml file into the directory "path/to/tests/data/April 2023 Patent Grant Bibliographic Data".
Add complete testing for the uspto, including all similar test cases of Crossref.(+ Add test cases for the Zip decompression caching)
Add a small dataset (50kB) containing 10 patents. Add patents with different application types. First two that have the same file name.(+ Update this by adding one smaller file that containes three chunks, one of which contains icpr_classifications. This way testing of Zip file opening, plus testing of Detail tables with conditions can take place.)
Add caching with file_xml_cache.py. Verifies through the container_id and if true returns the cached data. Otherwise parses the XML chunk.
Add caching of zip files. Every zip file is being read once and passes the contents of the concatenated XML chunks that contains inside in a list. This way every time the content are being accessed instantly except the first time. Add test cases.
Add a helper file xml.py. This files contains helper function for extracting elements and attributes from parsed XML files.
Fix orcid.py and uspto.py to utilize the helper function instead of initializing the similar processing functions.
Add documentation of the xml.py on the plugin API page and the US patent grant dataset.

dspinellis commented 1 year ago

Something I've found useful when developing a large PR or issue is to create and maintain a list of corresponding TODO items

[ ] like this one

It helps me keep track of required actions.

dspinellis commented 1 year ago

Regarding the population problem, enable SQL debug logging and examine commands for a working data source and for USPTO.

dspinellis commented 1 year ago

Please merge when you're done.

dspinellis / alexandria3k

Add USPTO data processing #15