dccabs / pair

Pair project
0 stars 0 forks source link

Can we get rid of any of this data? #15

Open dccabs opened 7 years ago

dccabs commented 7 years ago

updated with some 2017 info

here's an example of our data set from 2017. All years look this same I just chose this one because it's small. If you see something we don't need, Let me know. Give me me the key, and value.

Like say we don't need applicationNumberText.value and i'll know what you mean.

{
        "applicationDataOrProsecutionHistoryDataOrPatentTermData": [{
            "applicationNumberText": {
                "value": "14434661",
                "electronicText": "14434661"
            },
            "filingDate": "2017-01-18",
            "applicationTypeCategory": "UTILITY",
            "partyBag": {
                "applicantBagOrInventorBagOrOwnerBag": [{
                    "partyIdentifierOrContact": [{
                        "value": "1009"
                    }]
                }, {
                    "inventorOrDeceasedInventor": [{
                        "contactOrPublicationContact": [{
                            "name": {
                                "personNameOrOrganizationNameOrEntityName": [{
                                    "personStructuredName": {
                                        "firstName": "Jinming",
                                        "lastName": "Cui"
                                    }
                                }]
                            },
                            "cityName": "Guangzhou City, Guangdong",
                            "countryCode": "CN"
                        }],
                        "sequenceNumber": "1"
                    }, {
                        "contactOrPublicationContact": [{
                            "name": {
                                "personNameOrOrganizationNameOrEntityName": [{
                                    "personStructuredName": {
                                        "firstName": "Shijie",
                                        "lastName": "Zeng"
                                    }
                                }]
                            },
                            "cityName": "Guangzhou City, Guangdong",
                            "countryCode": "CN"
                        }],
                        "sequenceNumber": "2"
                    }, {
                        "contactOrPublicationContact": [{
                            "name": {
                                "personNameOrOrganizationNameOrEntityName": [{
                                    "personStructuredName": {
                                        "firstName": "Olaf",
                                        "lastName": "Eichstaedt"
                                    }
                                }]
                            },
                            "cityName": "Guangzhou City, Guangdong",
                            "countryCode": "CN"
                        }],
                        "sequenceNumber": "3"
                    }, {
                        "contactOrPublicationContact": [{
                            "name": {
                                "personNameOrOrganizationNameOrEntityName": [{
                                    "personStructuredName": {
                                        "firstName": "Jiandong",
                                        "lastName": "Huang"
                                    }
                                }]
                            },
                            "cityName": "Guangzhou City, Guangdong",
                            "countryCode": "CN"
                        }],
                        "sequenceNumber": "4"
                    }, {
                        "contactOrPublicationContact": [{
                            "name": {
                                "personNameOrOrganizationNameOrEntityName": [{
                                    "personStructuredName": {
                                        "firstName": "Ruxu",
                                        "lastName": "Du"
                                    }
                                }]
                            },
                            "cityName": "Guangzhou City, Guangdong",
                            "countryCode": "CN"
                        }],
                        "sequenceNumber": "5"
                    }]
                }, {
                    "primaryExaminerOrAssistantExaminerOrAuthorizedOfficer": [{
                        "name": {
                            "personNameOrOrganizationNameOrEntityName": [{
                                "personStructuredName": {
                                    "lastName": "-"
                                }
                            }]
                        }
                    }]
                }]
            },
            "groupArtUnitNumber": {
                "value": "1799",
                "electronicText": "1799"
            },
            "applicationConfirmationNumber": "1320",
            "applicantFileReference": "1971-006",
            "patentClassificationBag": {
                "cpcClassificationBagOrIPCClassificationOrECLAClassificationBag": [{
                    "ipcrClassification": [{
                        "patentClassificationText": "435"
                    }, {
                        "patentClassificationText": "288.700"
                    }]
                }]
            },
            "businessEntityStatusCategory": "SMALL",
            "firstInventorToFileIndicator": true,
            "inventionTitle": {
                "content": ["Device for Cell Culturing and Processing"]
            },
            "applicationStatusCategory": "Application Dispatched from Preexam, Not Yet Docketed",
            "applicationStatusDate": "2017-02-02",
            "officialFileLocationCategory": "ELECTRONIC",
            "patentPublicationIdentification": {
                "publicationNumber": "0"
            },
            "patentGrantIdentification": {
                "patentNumber": "0"
            }
        }, null, {
            "applicationPublication": {
                "patentPublicationIdentification": {
                    "publicationNumber": " 0  ",
                    "publicationDate": "0001-01-01"
                },
                "webURI": "http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1=0 .PGNR.&OS=DN/0 &RS=DN/0 "
            },
            "grantPublication": {
                "patentGrantIdentification": {
                    "patentNumber": "0"
                },
                "webURI": " "
            }
        }],
        "st96Version": "V2_0",
        "ipoVersion": "US_V6_0"
    }
absoluke commented 7 years ago

Hmmm....this is a great questions. Ima tag chris on this, too...i am wondering/questioning wondering if there is any information that we wouldn't want in this API, just in case....im adding chris and will look closer and contemplate further...this is like a foreign language to me...i have no programming background except fortran in first year of college.

dccabs commented 7 years ago

Hey Luke, I removed most of the data, now it's just one single Patent Record.

This is just the format that the data is stored in. If you go to tha pair bulk data website and search by application Number, put in this number "14434661" and you will see the visual representation of this same data on their website.

Totally not a big deal if we want to keep it all. I'm just running some batch scripts to edit all these files at once and figured if we didn't need any of this shit, i'd remove it.

absoluke commented 7 years ago

There are some fields that are less needed than others, any many important fields are blank on this example because its is such a newly filed application that hasn't published yet in US, but all in all, we may need all of these fields at some point or another, depending upon the info we need to serve to the website request or clients through alerts. Now, based upon what I am seeing, I think something I feared is going on with this data. If this is all the data that can be gotten for any record, we are limited to certain functionality, like being able to serve up "Status" and "Status" date which is real important and the first feature we'd like to provide in bulk and/or with the "soft wall". The data im seeing in this code above is the basic application data for this record. Go to the traditional Public Pair Interface at http://portal.uspto.gov/pair/PublicPair, enter captchs, and enter that application no and see what i mean. The data in the code above represents the "Application Data" tab. Notice that there are several other tabs of data, some meta, and some with images, namely the following Tabs:

Transaction History Image File Wrapper Continuity Data Foreign Priority Address & Attorney Agent Assignments; and Display References

I am almost certain that to provide a full-fledged competing PAIR monitoring service, we are going to have to eventually bulk up the API to use Reed Tech's scraped data and image data from these other tabs. As way of an important example, many customers will want to know when and only when a specific "type" of event happens in the "Transaction History" or "Prosecution History" Tabs because these events will trigger certain things they need to look up or tell their clients. The best way to see this is to click on the "Image File Wrapper" tab and see all the document images that are avail for download. To the left of each one is a "Document Code", and I've seen somewhere in my past a document that identifies what each Document Code means. So, our competitors like cardinal (our former employer) and reed tech scrape the images/pdfs and document codes daily, and send email alerts to customer when a specific code happens, and they also include a link to or an attached PDF of the document. So paying customer gets alert day after something happens, simply clicks and review the document. It is intelligence on demand without having to go into the main public pair website.

So that leads me to one simple question. Are you seeing any OTHER data in the JSON or XML or otherwise that is indicative of at least the text/metada that is present on any of those other tabs. If not, right now the max capability of our tool (albeit still very useful) is to give current status for a big list of applications all at once along with status date and title of invention or any of the other fields in that json above. However, to eventually offer a fully competing alert service and bury the competition and pay for our vacation homes in the Carribbean, we will need to ultimately API out all of the PAIR tabs, and images and provide full featured alerts at a better price, and with a better more user friendly interface. Just wanted to make sure you are on same page. However, if there is more data in these JSONs beside just the "Application" tab data, we can probably provide some great features even without fucking with the images or the "Document" codes at this time...Hell if we have the documents codes somewhere in there...we can tell customers exactly what's happening just without images, which they can go and download themselves, when something hits an important code state. Also companies take the image data and sell full file histories at several bucks per applicaiton. Make sense?

That will be the difference between them and us, at least at the outset...but the funny thing is that, hundreds of thousands of times per day, web users go to Public Pair just to retrieve "status" which is definitely in your JSON data sample above...and no one, that I am aware, is serving it for free, other than the uspto, and no one is letting you enter in a full list and quickly grabbing statuses all at once. So if the answer to my question above is, "shit luke, its only the application tab data", then we still have a valuable service and the question is how to monetize just that with softwall's, verified email addresses, credit card accounts etc. Positive note, that data we may not have from this JSON has to be distributed by Reed Tech, and as we grow, we use the data and a better service, better brand, to fully bury the 5 retards doing this in the industry. And I mean they are retards.

Let me know if I have confused you, and more importantly, let me know if there are any other data fields than just these in the application data tab. Here's a screenshot of the tabs on traditional public pair website: screen shot 2017-02-21 at 9 45 11 pm

clmulk commented 7 years ago

Shit. I hadn't really thought of that.

However, the good news, IMO, is that the major milestones for an application DO show up in the status data. For instance, when it first hits the system, it says something like: "Docketed case, Ready for Examination"

Next up, I believe, is "Non-Final Rejection Mailed" Next, "Final Rejection Mailed"

Then it can go in a few directions, namely, "Notice of Allowance", "Issue Fee Paid", "On Appeal", i'm sure a handful more.

Said "good news" is that most people who want PAIR alerts only care about one of these major milestones. They don't really care when an applicant files an amended spec, or a list of inventors, or an IDS - it's fairly meaningless from a "monitoring" perspective.

Our biggest client, for instance, explicitly said that he wouldn't want to see every event, just a Notice of Allowance, which again IS reflected in the "status" data.

So, at least there's that. Our system can just be tailored to the major events. I don't think there's a ton of value add to the other shit. Besides, I expect us to undercut pricing on EVERYONE, because we're automating. I think that combined with not having to talk to anyone to set up an alert will outweigh the need to actually go to PAIR after the notification of a milestone is received.

Just my two cents. I truly believe if this is as far as we can take it (or want to for now), it will still be very popular.

On Tue, Feb 21, 2017 at 10:10 PM, absoluke notifications@github.com wrote:

There are some fields that are less needed than others, any many important fields are blank on this example because its is such a newly filed application that hasn't published yet in US, but all in all, we may need all of these fields at some point or another, depending upon the info we need to serve to the website request or clients through alerts. Now, based upon what I am seeing, I think something I feared is going on with this data. If this is all the data that can be gotten for any record, we are limited to certain functionality, like being able to serve up "Status" and "Status" date which is real important and the first feature we'd like to provide in bulk and/or with the "soft wall". The data im seeing in this code above is the basic application data for this record. Go to the traditional Public Pair Interface at http://portal.uspto.gov/pair/ PublicPair, enter captchs, and enter that application no and see what i mean. The data in the code above represents the "Application Data" tab. Notice that there are several other tabs of data, some meta, and some with images, namely the following Tabs:

Transaction History Image File Wrapper Continuity Data Foreign Priority Address & Attorney Agent Assignments; and Display References

I am almost certain that to provide a full-fledged competing PAIR monitoring service, we are going to have to eventually bulk up the API to use Reed Tech's scraped data and image data from these other tabs. As way of an important example, many customers will want to know when and only when a specific "type" of event happens in the "Transaction History" or "Prosecution History" Tabs because these events will trigger certain things they need to look up or tell their clients. The best way to see this is to click on the "Image File Wrapper" tab and see all the document images that are avail for download. To the left of each one is a "Document Code", and I've seen somewhere in my past a document that identifies what each Document Code means. So, our competitors like cardinal (our former employer) and reed tech scrape the images/pdfs and document codes daily, and send email alerts to customer when a specific code happens, and they also include a link to or an attached PDF of the document. So paying customer gets alert day after something happens, simply clicks and review the document. It is intelligence on demand without having to go into the main public pair website.

So that leads me to one simple question. Are you seeing any OTHER data in the JSON or XML or otherwise that is indicative of at least the text/metada that is present on any of those other tabs. If not, right now the max capability of our tool (albeit still very useful) is to give current status for a big list of applications all at once along with status date and title of invention or any of the other fields in that json above. However, to eventually offer a fully competing alert service and bury the competition and pay for our vacation homes in the Carribbean, we will need to ultimately API out all of the PAIR tabs, and images and provide full featured alerts at a better price, and with a better more user friendly interface. Just wanted to make sure you are on same page. However, if there is more data in these JSONs beside just the "Application" tab data, we can probably provide some great features even without fucking with the images or the "Document" codes at this time...Hell if we have the documents codes somewhere in there...we can tell customers exactly what's happening just without images, which they can go and download themselves, when something hits an important code state. Also companies take the image data and sell full file histories at several bucks per applicaiton. Make sense?

That will be the difference between them and us, at least at the outset...but the funny thing is that, hundreds of thousands of times per day, web users go to Public Pair just to retrieve "status" which is definitely in your JSON data sample above...and no one, that I am aware, is serving it for free, other than the uspto, and no one is letting you enter in a full list and quickly grabbing statuses all at once. So if the answer to my question above is, "shit luke, its only the application tab data", then we still have a valuable service and the question is how to monetize just that with softwall's, verified email addresses, credit card accounts etc. Positive note, that data we may not have from this JSON has to be distributed by Reed Tech, and as we grow, we use the data and a better service, better brand, to fully bury the 5 retards doing this in the industry. And I mean they are retards.

Let me know if I have confused you, and more importantly, let me know if there are any other data fields than just these in the application data tab. Here's a screenshot of the tabs on traditional public pair website: [image: screen shot 2017-02-21 at 9 45 11 pm] https://cloud.githubusercontent.com/assets/25540660/23196961/8dca155c-f882-11e6-95ed-d881fc07ad5b.png

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/dccabs/pair/issues/15#issuecomment-281563683, or mute the thread https://github.com/notifications/unsubscribe-auth/AYZ7VLJ0OJR1NW1gmeKuL8cvKvH4QJFcks5re7VCgaJpZM4MH6q1 .

dccabs commented 7 years ago

To answer your question Luke, I don't think so. I'm pretty sure that's all the data the PAIR public API is giving us.

Basically if you can't do it on the pair bulk data site, you can't do it on ours. and vice versa, anything you can do on theirs you can do on ours, but with unlimited requests.

My thinking is along the line's of Chris's. We set out to build a service that is a batch request tool for several types of numbers and to get statuses right? Let's accomplish that, and make it awesome. Then we'll start iterating on top of that. We'll do as much as we can with our api.

Getting into the business of scraping html pages isn't really my cup of tea. It takes way too much time to set that up, and to maintain it. But let's cross that bridge when we get to it.

Right now I see the plan as this.

  1. Get the api mirror site up and running.
  2. Set up the mechanism for it to update on a daily basis.
  3. Build the batch request tool (already have a semi working prototype for this part).