allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

Parse PDF to json #1

Closed Mayar2009 closed 4 years ago

Mayar2009 commented 4 years ago

In your article you have mentioned that " PDF to JSON We use GROBID5 (Lopez, 2009) to process each PDF: (i) extract the paper’s Title6, Authors, Year, Venue and Abstract, (ii) extract paragraphs from the Body text organized under extracted Section headings, (iii) extract Figure and Table captions, (iv) remove equations, table content, headers, and footers from the body text, (v) extract in-line citations from the abstract and body text, (vi) extract and parse each bibliography entry, identifying its Title, Authors, Year, and Venue, and (vi) link the in-line citation mentions to their corresponding bibliography entries. The resulting parses are expressed in JSON format as described in Appendix E."

Is your parser open source code? Can we use it?

kyleclo commented 4 years ago

Hey @Mayar2009 thanks for your interest in the project! We'd be happy to share the parser code once we finish making some more changes / cleaning up the code.

Mayar2009 commented 4 years ago

thanks! hope soon ! keep me please in the process!

viveksck commented 4 years ago

Hello @kyleclo, I also noticed that the value for the section field is pretty much empty in most of the Grob Parses (except for of course the abstract). It would be very useful to know if citation context is from the Introduction, Related Work or Methods sections.

I am also wondering if author affiliations could be added to the JSON.

kyleclo commented 4 years ago

Hey @viveksck Thanks for catching that! I agree that the section information is super important, and we've had several users express that they want that information. Unfortunately I think this is an issue with GROBID parsing which isn't super accurate when it comes to section information. I tried out ScienceParse, which is also similarly iffy with sections.

I'm currently in the process of looking into whether there's something that can be easily fixed within GROBID (or modified within our parser) that will capture Sections better. In the meantime, if you happen to know of a PDF parsing tool that handles Sections better than GROBID/ScienceParse, I'd love to try running it through the pipeline.

Regarding author affiliations, they're in there, as can see from paper_id 140112606 from batch 7290.jsonl.gz:

{
  "title": "Development of a Cassava Starch Extraction Machine",
  "authors": [
    {
      "first": "L",
      "middle": [],
      "last": "Olutayo",
      "suffix": "",
      "affiliation": {
        "laboratory": "",
        "institution": "Rufus Giwa Polytechnic",
        "location": {
          "settlement": "Owo, Ondo-state",
          "country": "Nigeria"
        }
      },
      "email": ""
    },

But I also agree that they're sparse. We didn't spend a lot of focus on making sure author affiliation was accurate (just author names), and it seems unlikely that will have time to get to it since the focus is more on quality of the body text (e.g. sections, citations, etc.)

Mayar2009 commented 4 years ago

Besides using Grobid we could not parse references. if you could help with that I will be grateful

kyleclo commented 4 years ago

@Mayar2009 could you explain what you mean by "could not parse references"? I'm pretty sure GROBID outputs references, when it manages to detect them?

Mayar2009 commented 4 years ago

@kyleclo yes I can see them in the tei file but seems I need a parser to convert the tei file to json file and get them in a separate section I am asking about that

kyleclo commented 4 years ago

Hey @Mayar2009, thanks for this request. We've heard from others too that also want access to our TeiXML to JSON parser, so we're planning on releasing it.

Mayar2009 commented 4 years ago

@kyleclo thanks!