freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
354 stars 106 forks source link

NY App Div CaseNames lack proper spacing #36

Closed brianwc closed 8 years ago

brianwc commented 9 years ago

See cases from Oct. 23 such as MaxonAlcoHoldings,LLCvSTSSteel,Inc. (N.Y. App. Div. 2014) https://www.courtlistener.com/?q=&stat_Precedential=on&order_by=dateFiled+desc&court=nyappdiv

mlissner commented 9 years ago

That's sad, but there is some good news. We already have code to handle this kind of heinous camelcasing of case names:

https://github.com/freelawproject/juriscraper/blob/master/lib/string_utils.py#L171

So, I think we just need to throw this into the NY App Div scraper somewhere and we should be off and running again.

Depending on the size of the stuff we already have, we may want to write a fix-script, or alternatively just fix it manually.

There are also tests for fix_camel_case:

https://github.com/freelawproject/juriscraper/blob/master/tests/tests.py#L388

So if it doesn't quite work for NY App Div at first, we can tweak it without too much worry. Just add more test cases, run them, tweak the code till they pass, etc.

mlissner commented 9 years ago

Anybody can take this on. Removing my assignment on this one.

arderyp commented 8 years ago

is this still an issue? I found the page with @brianwc's example above, pointed our current nyappdiv_3rd scraper at this specific page, scraped it, dumped the json, and I see what appears to be a properly formatted case name:

{
    "case_names": "Maxon Alco Holdings, LLC v. STS Steel, Inc.",
    "case_dates": "2016-03-03",
    "blocked_statuses": false,
    "download_urls": "http://decisions.courts.state.ny.us/ad3/Decisions/2014/517378.pdf",
    "precedential_statuses": "Published",
    "case_name_shorts": "",
    "docket_numbers": "517378"
},

That being said, there is an issue with the parsing of dual docker numbers. Some entries legitimatly show dual docket numbers with a slash delimeter, like xxxxxx/yyyyyy, but it appears that some entries have a slash delimeter, but no second case number (maybe human error?). On the same page linked above, we see "515342/Matter of Neroni v Granis", which parses to:

{
    "case_names": "of Neroni v. Granis",
    "case_dates": "2016-03-03",
    "blocked_statuses": false,
    "download_urls": "http://decisions.courts.state.ny.us/ad3/Decisions/2014/515342-515341.pdf",
    "precedential_statuses": "Published",
    "case_name_shorts": "Granis",
    "docket_numbers": "515342/Matter"
},

I can fix this later in the week.

mlissner commented 8 years ago

Yeah, let's close this one since it's fixed, and I'll leave the docket issue in your hands? Or we can open another issue for that, if you wish. Be great to get that fixed.

arderyp commented 8 years ago

@mlissner yup, I'll submit a PR to fix the docket number issue shortly