dspinellis / alexandria3k

Local relational access to openly-available publication data sets
GNU General Public License v3.0
81 stars 14 forks source link

Add USPTO Schema #24

Closed AggelosMargkas closed 1 year ago

AggelosMargkas commented 1 year ago

Make Alexandria3k fully access all the published US patent bibliographic data from 2005 to now (September 2023).

Add new cursors for tables for some tables that column filling couldn't be done with a simple getter function and had to cope with different versions of DTD.

Add new TableMeta object USPartiesTableMeta that contains similar columns of tables usp_inventors, usp_applicants, usp_agents to avoid duplicates. These three tables appear in the DTD under the hood of us-parties element and share many properties. Thus, the name the USPartiesTableMeta.

Changed the file reading of uspto.py to fit how the bulk data are provided. The reading now reads through a folder that represents a whole year and includes inside all the weekly published patents from the US office.

Respectively changed the test dataset and its reading through the test files.

Change PatentsIcprCursor to PatentsDetailsCursor , since it applies to various tables and not only the icpr_classifications table, now changed to ups_icpr_classifications.

Add one helping function alternative_path_getter :

Removed some properties of the us_patents table, after I run a COUNT query over all the dataset and returned 0. Removed columns: microform_number, hague_filing_date, hague_reg_pub_date, hague_reg_date, sir_flag

Updated the relationship of tables under us_patents into the uspto.dot file.

Add tests for the new tables, testing that the record counted both with partition and without are the same as the entries in the sample dataset.

Fixed a double space in orcid.py.

AggelosMargkas commented 1 year ago

A thumbs up for done is fine

Thank you, I am on it!

dspinellis commented 1 year ago

Well done!