dspinellis / alexandria3k

Local relational access to openly-available publication data sets
GNU General Public License v3.0
81 stars 14 forks source link

Add USPTO data processing #15

Closed AggelosMargkas closed 1 year ago

AggelosMargkas commented 1 year ago

Add processing of USPTO data files.

After opening the directory and iterating over the zip files, it appends the weekly uspto zip file a list "file_path". Afterwards for every weekly zip counts the concatenated xml files inside, which contains multiple small XML files insides. Each of these concatenated XML files represent one US patent grant.

Using a generator send on demand a container id variable representative of the US patent. No parsing is taking place.

In this class is used the caching. Using the content and the container id the caching takes place, when moving to the next available chunk. Successful or not, returns an element tree. This element tree object is the current row value.

dspinellis commented 1 year ago

Something I've found useful when developing a large PR or issue is to create and maintain a list of corresponding TODO items

It helps me keep track of required actions.

dspinellis commented 1 year ago

Regarding the population problem, enable SQL debug logging and examine commands for a working data source and for USPTO.

dspinellis commented 1 year ago

Please merge when you're done.