some questions about 0.jsonl for example

Mayar2009 commented 4 years ago

why somwtimes the doi feild is string ["10.1029/2002JB001919"] and sometimes it is just list of string
why do you have two abstract feilda in jsonl files one in metadata and the other in probid_parse or latex_parse what is the difference between them? I did not understand 3.in grobid_parse why abstarct sometimes is [] and some time is null? for example
what do you mean by the other_ids is it just doi ? 5.in bib_entries ( i do not remember which paper but as an example)

for b10

                              'BIBREF10': {'authors': [{'first': 'J',
                                                        'last': 'Lyons',
                                                        'middle': ['J'],
                                                        'suffix': ''},
                                                       {'first': 'G',
                                                        'last': 'Waite',
                                                        'middle': ['P'],
                                                        'suffix': ''},
                                                       {'first': 'M',
                                                        'last': 'Ichihara',
                                                        'middle': [],
                                                        'suffix': ''},
                                                       {'first': 'J',
                                                        'last': 'Lees',
                                                        'middle': ['M'],
                                                        'suffix': ''}],
                                           'issn': '',
                                           'links': None,
                                           'other_ids': {},
                                           'pages': '',
                                           'ref_id': 'b10',
                                           'title': 'Tilt prior to '
                                                    'explosions and the '
                                                    '494 effect of '
                                                    'topography on '
                                                    'ultra-long-period '
                                                    'seismic records at '
                                                    'Fuego volcano',
                                           'venue': '',
                                           'volume': '',
                                           'year': 2012}

but the reference in the paper was for 'BIBREF10' so why we need ref_id in 'BIBREF10'?

'cite_spans': [{'end': 1698, 'latex': None, 'ref_id': 'BIBREF10', 'start': 1678, 'text': '[Lyons et al., ' '2012]'}, {'end': 1863, 'latex': None, 'ref_id': None, 'start': 1857, 'text': '[2011]'}],

Mayar2009 commented 4 years ago

what do you mean by latex in ref_spans, eq_spans for example?

Mayar2009 commented 4 years ago

?Hi again! what is the difference between

gropid_parse = {'abstract': None, 'body_text': None, 'ref_entries': None, 'bib_entries': {}} for paper_id = 10022478 and gropid_parse = None for paper_id = 100017287 for example in your jsonl files?

============================== you have introduce scheme "authors": [ { "first": "first_name", "middle": ["middle_name"], "last": "last_name", "suffix": "suffix_name" }

but

paper_id = 199502503 has 'first', 'middle', 'last', 'suffix', 'affiliation', 'email' so iam confused now

kyleclo commented 4 years ago

Hey @Mayar2009, can you edit your comment to shorten the large dump of data? It's a bit hard to address your questions because I have to scroll through a lot of text to find them. Thanks!

Mayar2009 commented 4 years ago

yes I did @kyleclo

kyleclo commented 4 years ago

Regarding DOI, that seems to be a bug. Can you point me to which paper_id has that? I'll fix it for the next release. Thanks for catching.
Ah, that's intentional. There are 2 ways we can get abstracts: From publisher-provided metadata (e.g. PubMed), or parsed from the PDF itself. We decided to include both because internally, there are times we want to use the publisher-provided abstracts, and at times we wanted the abstract parsed from PDFs. The publisher-provided abstracts are kept under "Metadata" while the PDF-parsed abstracts are nested under "*_parse"
In bibliographies, sometimes people will include other identifiers (e.g. some people include DOIs or arXiv IDs or Pubmed IDs). That would normally be contained within the "other IDs" field. It's rare because most people don't include such identifiers in their bibliography entries.
Ah yes, that's an artifact of the previous processing that we're currently thinking about whether to keep or remove. It doesn't serve any tangible purpose currently

kyleclo commented 4 years ago

what do you mean by latex in ref_spans, eq_spans for example?

are you looking at the grobid parse or the latex parse?

kyleclo commented 4 years ago

gropid_parse = {'abstract': None, 'body_text': None, 'ref_entries': None, 'bib_entries': {}} for paper_id = 10022478 and gropid_parse = None for paper_id = 100017287 for example in your jsonl files?

If a paper did not come with an accompanying PDF, there was nothing to parse. Hence, we leave the parse as None.

If the paper came with an accompanying PDF, and the PDF parsing executed successfully, then there will be a Dictionary of fields.

For transparency, we wanted to keep these two cases separate (that is, "We got a PDF but didn't process it correctly" vs "We never got a PDF"), but I can see how it's confusing.

We'll consider removing these for future release

kyleclo commented 4 years ago

you have introduce scheme "authors": [ { "first": "first_name", "middle": ["middle_name"], "last": "last_name", "suffix": "suffix_name" }

but

paper_id = 199502503 has 'first', 'middle', 'last', 'suffix', 'affiliation', 'email' so iam confused now

Nice catch! That's a bug, thanks for identifying it. I'll look into fixing

Mayar2009 commented 4 years ago

all these issues in 0.jsonl and I did not finish exploring all what I want (what do you mean by latex in ref_spans, eq_spans for example?) till now I am concentrating on gropid parse I did not decide yet to look at latex parse

kyleclo commented 4 years ago

(what do you mean by latex in ref_spans, eq_spans for example?) till now I am concentrating on gropid parse I did not decide yet to look at latex parse

latex does not exist for grobid parses, only for latex parses. We'll remove in future.

all these issues in 0.jsonl and I did not finish exploring all what I want

Yea, as with all large data releases, there's going to be things we didn't catch; thanks for identifying these. We'll make adjustments in subsequent releases

Mayar2009 commented 4 years ago

it does exist in grobid prase please look again to scheme introduced

"grobid_parse": {
    "abstract": [
      {
        "text": "abstract_string",
        "cite_spans": [],
        "ref_spans": [],
        "eq_spans": null,
        "section": "Abstract"
      }
    ],
    "body_text": [
      {
        "text": "paragraph_string",
        "cite_spans": [
          {
            "start": 15,
            "end": 18,
            "text": "cite_span_text",
            "latex": null,
            "ref_id": "bibkey (index for bib_entries)"
          }
        ],
        "ref_spans": [
          {
            "start": 0,
            "end": 8,
            "text": "ref_span_text",
            "latex": null,
            "ref_id": "refkey (index for ref_entries)"
          }
        ],
        "eq_spans": [],
        "section": null
      }
    ],
    "ref_entries": {
      "refkey": {
        "text": "ref_string",
        "latex": null,
        "type": "ref_type (figure, table, equation, etc)"
      }
    },
    "bib_entries": {
      "bibkey": {
        "ref_id": "string",
        "title": "title_string",
        "authors": [
          {
            "first": "first_name",
            "middle": [],
            "last": "last_name",
            "suffix": "suffix_name"
          }
        ],
        "year": 2019,
        "venue": "venue_string",
        "volume": "volume_string",
        "issn": "issue_number_string",
        "pages": "pages_string",
        "other_ids": {
          "doi": ["doi_string"]
        },
        "links": "linked_paper_id"
      }
    }
  }

thanksfor yor response!

kyleclo commented 4 years ago

Yes, the keys exist in all Span json objects. I mean that they're always null valued for the grobid_parse and only exist for the latex_parse.

Mayar2009 commented 4 years ago

ok, if it is possible to ask when the future release will be ready?

kyleclo commented 4 years ago

Likely sometime in May

Mayar2009 commented 4 years ago

thanks! we are waiting)

Mayar2009 commented 4 years ago

@kyleclo Hi! I got your point about abstract and I have a question please again about it First, is your example_papers.jsonl file has the same structure as 0.jasonl file? Second, after looking into the abstract of the first 20 papers, for example, papers.jsonl file I got this result:

paper[0]["metadata"]["abstarct"] which has number (104172): has not abstract in metadata paper[0]["grobid_parse"]["abstract"] which has number (104172): []

paper[1]["metadata"]["abstarct"] which has number (1003291): has not abstract in metadata paper[1]["grobid_parse"]["abstract"] which has number (1003291): None

*paper[2]["abstract"] which has number (1009792): By systematic examination of common tag single-nucleotide polymorphisms (SNPs) across the genome, the genome-wide association study (GWAS) has proven to be a successful approach to identify genetic variants that are associated with complex diseases and traits. Although the per base pair cost of sequencing has dropped dramatically with the advent of the next-generation technologies, it may still only be feasible to obtain DNA sequence data for a portion of available study subjects due to financial constraints. Two-phase sampling designs have been used frequently in large-scale surveys and epidemiological studies where certain variables are too costly to be measured on all subjects. We consider two-phase stratified sampling designs for genetic association, in which tag SNPs for candidate genes or regions are genotyped on all subjects in phase 1, and a proportion of subjects are selected into phase 2 based on genotypes at one or more tag SNPs. Deep sequencing in the region is then applied to genotype phase 2 subjects at sequence SNPs. We investigate alternative sampling designs for selection of phase 2 subjects within strata defined by tag SNP genotypes and develop methods of inference for sequence SNP variant associations using data from both phases. In comparison to methods that use data from phase 2 alone, the combined analysis improves efficiency. paper[2]["grobid_parse"]["abstract"] which has number (1009792): None

paper[3]["metadata"]["abstarct"] which has number (10006097): has not abstract in metadata paper[3]["grobid_parse"]["abstract"] which has number (10006097): None

paper[4]["metadata"]["abstarct"] which has number (10022478): has not abstract in metadata paper[4]["grobid_parse"]["abstract"] which has number (10022478): None

*paper[5]["abstract"] which has number (100203934): Nickel oxyhydroxide (NiOOH) is considered to be one of the best-known catalysts for the water oxidation reaction. Recently, progress has been made in pushing the limits of water splitting efficiency by incorporating NiOOH in photo-electrochemical cell architectures. Despite these cutting-edge advances, some basic questions have yet been fully answered. This perspective highlights the three most critical questions that are considered to be the very first step for any theoretical investigation. We suggest possible ways to answer these questions from a theoretician’s perspective. Progress toward this direction is expected to shed light on the origin of NiOOH’s success. paper[5]["grobid_parse"]["abstract"] which has number (100203934): None

*paper[6]["abstract"] which has number (10052029): The objective of this research is to develop a reliable non invasive method to measure fat content in beef and fish fillet using ultrasound A-Mode scan. The results of the fat measurement from this non invasive method were then compared to the results of fat measurement using a proven method. The Soxhlet method is a standard fat measurement procedure recommended by the Association of Official Analytical Chemist (AOAC). The samples used in this investigation are chicken, meat and fish fillets. The experimental results showed that there is correlation between fat content and the measured ultrasound velocity travelled in the sample. This indicates that fat measurement using the ultrasound A-Mode scan technique can be used to determine fat content in fillet with reasonable accuracy. paper[6]["grobid_parse"]["abstract"] which has number (10052029): None

*paper[7]["abstract"] which has number (10070197): Antipsychotic drugs are effective for the treatment of schizophrenia. However, the functional consequences and subcellular sites of their accumulation in nervous tissue have remained elusive. Here, we investigated the role of the weak-base antipsychotics haloperidol, chlorpromazine, clozapine, and risperidone in synaptic vesicle recycling. Using multiple live-cell microscopic approaches and electron microscopy of rat hippocampal neurons as well as in vivo microdialysis experiments in chronically treated rats, we demonstrate the accumulation of the antipsychotic drugs in synaptic vesicles and their release upon neuronal activity, leading to a significant increase in extracellular drug concentrations. The secreted drugs exerted an autoinhibitory effect on vesicular exocytosis, which was promoted by the inhibition of voltage-gated sodium channels and depended on the stimulation intensity. Taken together, these results indicate that accumulated antipsychotic drugs recycle with synaptic vesicles and have a use-dependent, autoinhibitory effect on synaptic transmission. paper[7]["grobid_parse"]["abstract"] which has number (10070197): [{'text': 'SUMMARYAntipsychotic drugs are effective for the treatment of schizophrenia. However, the functional consequences and subcellular sites of their accumulation in nervous tissue have remained elusive. Here, we investigated the role of the weak-base antipsychotics haloperidol, chlorpromazine, clozapine, and risperidone in synaptic vesicle recycling. Using multiple live-cell microscopic approaches and electron microscopy of rat hippocampal neurons as well as in vivo microdialysis experiments in chronically treated rats, we demonstrate the accumulation of the antipsychotic drugs in synaptic vesicles and their release upon neuronal activity, leading to a significant increase in extracellular drug concentrations. The secreted drugs exerted an autoinhibitory effect on vesicular exocytosis, which was promoted by the inhibition of voltage-gated sodium channels and depended on the stimulation intensity. Taken together, these results indicate that accumulated antipsychotic drugs recycle with synaptic vesicles and have a use-dependent, autoinhibitory effect on synaptic transmission.', 'cite_spans': [], 'ref_spans': [], 'eq_spans': [], 'section': 'Abstract'}]

paper[8]["metadata"]["abstarct"] which has number (10103267): has not abstract in metadata paper[8]["grobid_parse"]["abstract"] which has number (10103267): None

*paper[9]["abstract"] which has number (10105476): Understanding the penetration dynamics of intruders in granular beds is relevant not only for fundamental physics, but also for geophysical processes and construction on sediments or granular soils in areas potentially affected by earthquakes. While the penetration of intruders in two dimensional (2D) laboratory granular beds can be followed using video recording, this is useless in three dimensional (3D) beds of non-transparent materials such as common sand. Here, we propose a method to quantify the sink dynamics of an intruder into laterally shaken granular beds based on the temporal correlations between the signals from a reference accelerometer fixed to the shaken granular bed, and a probe accelerometer deployed inside the intruder. Due to its analogy with the working principle of a lock-in amplifier, we call this technique lock-in accelerometry. paper[9]["grobid_parse"]["abstract"] which has number (10105476): []

*paper[10]["abstract"] which has number (101178602): Abstract From extraction experiments with 22Na as a tracer, the extraction constant corresponding to the equilibrium Na+(aq) + A−(aq) + L(nb) ⇔ NaL+(nb) + A−(nb) taking place in the two-phase water–nitrobenzene system (A− = picrate, L = dibenzo-21-crown-7; aq = aqueous phase, nb = nitrobenzene phase) was evaluated as logKex(NaL+,A−) = 1.9±0.1. Further, the stability constant of the dibenzo-21-crown-7-sodium complex in nitrobenzene saturated with water was calculated for a temperature of 25 °C log βnb(NaL+) = 7.1±0.1. paper[10]["grobid_parse"]["abstract"] which has number (101178602): None

*paper[11]["abstract"] which has number (10125224): As cattle mature, the dietary protein requirement, as a percentage of the diet, decreases. Thus, decreasing the dietary CP concentration during the latter part of the finishing period might decrease feed costs and N losses to the environment. Three hundred eighteen medium-framed crossbred steers (315 +/- 5 kg) fed 90% (DM basis) concentrate, steam-flaked, corn-based diets were used to evaluate the effect of phase-feeding of CP on performance and carcass characteristics, serum urea N concentrations, and manure characteristics. Steers were blocked by BW and assigned randomly to 36 feedlot pens (8 to 10 steers per pen). After a 21-d step-up period, the following dietary treatments (DM basis) were assigned randomly to pens within a weight block: 1) 11.5% CP diet fed throughout; 2) 13% CP diet fed throughout; 3) switched from an 11.5 to a 10% CP diet when approximately 56 d remained in the feeding period; 4) switched from a 13 to an 11.5% CP diet when 56 d remained; 5) switched from a 13 to a 10% CP diet when 56 d remained; and 6) switched from a 13 to an 11.5% CP diet when 28 d remained. Blocks of cattle were slaughtered when approximately 60% of the cattle within the weight block were visually estimated to grade USDA Choice (average days on feed = 182). Nitrogen volatilization losses were estimated by the change in the N:P ratio of the diet and pen surface manure. Cattle switched from 13 to 10% CP diets with 56 d remaining on feed or from 13 to 11.5% CP with only 28 d remaining on feed had lower (P < 0.05) ADG, DMI, and G:F than steers fed a 13% CP diet throughout. Steers on the phase-feeding regimens had lower (P = 0.05) ADG and DMI during the last 56 d on feed than steers fed 13.0% CP diet throughout. Carcass characteristics were not affected by dietary regimen. Performance by cattle fed a constant 11.5% CP diet did not differ from those fed a 13% CP diet. Serum urea N concentrations increased (P < 0.05) with increasing dietary CP concentrations. Phase-feeding decreased estimated N excretion by 1.5 to 3.8 kg/steer and nitrogen volatilization losses by 3 to 5 kg/steer. The results suggest that modest changes in dietary CP concentration in the latter portion of the feeding period may have relatively small effects on overall beef cattle performance, but that decreasing dietary CP to 10% of DM would adversely affect performance of cattle fed high-concentrate, steam-flaked, corn-based diets. paper[11]["grobid_parse"]["abstract"] which has number (10125224): None

*paper[12]["abstract"] which has number (10138504): A method to stabilize a fuzzy logic controller (FLC) is presented in which the sliding mode control (SMC) method and the adaptive control (ADC) scheme are properly incorporated to overcome the unstable characteristics of the conventional FLC. The SMC scheme is adopted to compensate for minimum approximation error (MAE) due to limited approximation capability of FLC which can make the FLC system unstable. The ADC scheme is used to tune the center values of membership functions in a direction of keeping stability. The suggested method can be considered as FLC or SMC, depending on a switching condition and each advantage is shown in terms of FLC or SMC. Finally, simulations are given to illustrate effectiveness of the results. paper[12]["grobid_parse"]["abstract"] which has number (10138504): None

*paper[13]["abstract"] which has number (10157395): The complete genome sequence of a human enterovirus 71 strain (SH12-276), isolated from a fatal case in Shanghai in 2012, was determined. Phylogenetic analysis based on the complete genome sequence classified this strain into subgenotype C4. paper[13]["grobid_parse"]["abstract"] which has number (10157395): [{'text': 'The complete genome sequence of a human enterovirus 71 strain (SH12-276), isolated from a fatal case in Shanghai in 2012, was determined. Phylogenetic analysis based on the complete genome sequence classified this strain into subgenotype C4.', 'cite_spans': [], 'ref_spans': [], 'eq_spans': [], 'section': 'Abstract'}]

*paper[14]["abstract"] which has number (10164018): We investigate the problem of reader-aware multi-document summarization (RA-MDS) and introduce a new dataset for this problem. To tackle RA-MDS, we extend a variational auto-encodes (VAEs) based MDS framework by jointly considering news documents and reader comments. To conduct evaluation for summarization performance, we prepare a new dataset. We describe the methods for data collection, aspect annotation, and summary writing as well as scrutinizing by experts. Experimental results show that reader comments can improve the summarization performance, which also demonstrates the usefulness of the proposed dataset. The annotated dataset for RA-MDS is available online. paper[14]["grobid_parse"]["abstract"] which has number (10164018): [{'text': 'AbstractWe investigate the problem of readeraware multi-document summarization (RA-MDS) and introduce a new dataset for this problem. To tackle RA-MDS, we extend a variational auto-encodes (VAEs) based MDS framework by jointly considering news documents and reader comments. To conduct evaluation for summarization performance, we prepare a new dataset. We describe the methods for data collection, aspect annotation, and summary writing as well as scrutinizing by experts. Experimental results show that reader comments can improve the summarization performance, which also demonstrates the usefulness of the proposed dataset. The annotated dataset for RA-MDS is available online 1 .', 'cite_spans': [], 'ref_spans': [], 'eq_spans': [], 'section': 'Abstract'}]

*paper[15]["abstract"] which has number (10172550): The next revision of the international standard for high-voltage measurement techniques, IEC 60060-1, has been planned to include a new method for evaluating the parameters associated with lightning impulse voltages. This would be a significant improvement on the loosely defined existing method which is, in part, reliant on operator judgment and would ensure that a single approach is adopted worldwide to determine peak voltage, front, and tail times, realizing standardization in measured parameters across all laboratories. Central to the proposed method is the use of a K-factor to attenuate oscillations and overshoots that can occur with practical generation of impulse voltages for testing on high-voltage equipment. It is proposed that a digital filter that matches the K-factor gain characteristic be implemented and used for this purpose. To date, causal filter designs have been implemented and assessed. This paper is concerned with the potential application of a noncausal digital filter design to emulate the K-factor. The approach has several advantages; the resulting design is only second order, it can be designed without using optimization algorithms, it is a zero-phase design and it matches the K-factor almost perfectly. Parameter estimation using waveforms from the IEC 61083-2 test data generator and experimental impulse voltages has been undertaken and obtained results show that the zero-phase filter is the ideal digital representation of the proposed K-factor. The effect of evaluating parameters by the proposed method is compared to mean-curve fitting and the challenge of effective front-time evaluation is discussed. paper[15]["grobid_parse"]["abstract"] which has number (10172550): None

*paper[16]["abstract"] which has number (10173004): The space mapping technique is intended for optimization of engineering models which involve very expensive function evaluations. It is assumed that two different models of the same physical system are available: Besides the expensive model of primary interest (denoted the fine model), access to a cheaper (coarse) model is assumed which may be less accurate.The main idea of the space mapping technique is to use the coarse model to gain information about the fine model, and to apply this in the search for an optimal solution of the latter. Thus the technique iteratively establishes a mapping between the parameters of the two models which relate similar model responses. Having this mapping, most of the model evaluations can be directed to the fast coarse model.In many cases this technique quickly provides an approximate optimal solution to the fine model that is sufficiently accurate for engineering purposes. Thus the space mapping technique may be considered a preprocessing technique that perhaps must be succeeded by use of classical optimization techniques. We present an automatic scheme which integrates the space mapping and classical techniques. paper[16]["grobid_parse"]["abstract"] which has number (10173004): None

*paper[17]["abstract"] which has number (10185888): An online tuning observer based adaptive fuzzy controller with modulated membership function (OAFCMMF) for uncertain nonlinear systems is proposed in this paper. By including micro-genetic algorithm (MGA), the width of the membership functions is modulated based on fuzzy orthogonal condition. The proposed fuzzy controller can online adjust not only weighting factors in the consequence part but also the membership functions in the antecedent part. Computation time is shortened to improve controller performance. Moreover, we use fitness function for online tuning the parameter vector of the fuzzy controller. The fitness function is based on stability criterion established by Lyapunov method. For meeting stability condition, a supervisory controller is implemented in a closed-loop nonlinear system to smoothen controller operation. paper[17]["grobid_parse"]["abstract"] which has number (10185888): None

*paper[18]["abstract"] which has number (10209620): BACKGROUND AND PURPOSE Exhaled carbon monoxide (CO) is associated with cardiometabolic traits, subclinical atherosclerosis, and cardiovascular disease, but its specific relations with stroke are unexplored. We related exhaled CO to magnetic resonance imaging measures of subclinical cerebrovascular disease cross-sectionally and to incident stroke/transient ischemic attack prospectively in the Framingham Offspring study. METHODS We measured exhaled CO in 3313 participants (age 59±10 years; 53% women), and brain magnetic resonance imaging was available in 1982 individuals (age 58±10 years; 54% women). Participants were analyzed according to tertiles of exhaled CO concentration. RESULTS In age- and sex-adjusted models, the highest tertile of exhaled CO was associated with lower total cerebral brain volumes, higher white-matter hyperintensity volumes, and greater prevalence of silent cerebral infarcts (P<0.05 for all). The results for total cerebral brain volume and white-matter hyperintensity volume were consistent after removing smokers from the sample, and the association with white-matter hyperintensity volume persisted after multivariable adjustment (P=0.04). In prospective analyses (mean follow-up 12.9 years), higher exhaled CO was associated with 67% (second tertile) and 97% (top tertile) increased incidence of stroke/transient ischemic attack relative to the first tertile that served as referent (P<0.01 for both). These results were consistent in nonsmokers and were partially attenuated upon adjustment for vascular risk factors. CONCLUSIONS In this large, community-based sample of individuals without clinical stroke/transient ischemic attack at baseline, higher exhaled CO was associated with a greater burden of subclinical cerebrovascular disease cross-sectionally and with increased risk of stroke/transient ischemic attack prospectively. Further investigation is necessary to explore the biological mechanisms linking elevated CO with stroke. paper[18]["grobid_parse"]["abstract"] which has number (10209620): None

*paper[19]["abstract"] which has number (10212622): A routing heuristic is presented thatroutes two-terminal nets one at a time, for each net choosing the path so as to avoid adversely impacting the nets not yet routed. An algorithm is presented and proved to correctly implement this heuristic; the computational complexity of that algorithm is shown to be polynomially bounded, but perhaps still too great to be of practical use. Another, speedier algorithm is presented that seems to approximate the heuristic rather closely. Strong evidence is given that the Lee routing algorithm is in some sense inadequate to implement this heuristic. The heuristic has been applied, with very encouraging results, to a specific routing problem: the routing of a channel in which all four sides of the channel may contain terminals. This problem arises in the layout of custom VLSI. paper[19]["grobid_parse"]["abstract"] which has number (10212622): None

my questions are: why the abstract has two different structures in metadata and gropid_parse? in scientific papers to my knowledge citations are not allowed. why abstract is some times None and sometimes is []?

kyleclo commented 4 years ago

Hey @Mayar2009 the abstract in "metadata" and abstract in "grobid_parse" are different. The former is a gold abstract sourced directly from the publisher (or whichever source we got the paper from). This can have mistakes, but in-general we trust these the most. The latter is any abstract that is being parsed from the PDF directly. These may not exist because (1) we dont have the PDF, (2) we dont have permission to release text from the PDF, (3) our PDF-parsing failed to find the abstract, (4) the PDF was distributed without an abstract [unlikely], (5) the abstract was parsed but mis-detected as a body paragraph

It's allowed to have citations in abstracts. Rare but it happens.

None vs [] is our way of documenting whether there was nothing to parse (None) or parsing failed (empty list). We're reconsidering whether that was a good decision since it seems to be confusing

Mayar2009 commented 4 years ago

@kyleclo thanks for immediate response ! so always in your logic of parsing: [] : parsing is failed None: there was nothing to parse another notice maybe is the same as the issue#10

the section field of any (pape['grobid_parse']['body_text']) is always None

Mayar2009 commented 4 years ago

I could not understand why many papers have get_citation_contexts = [] for example paper number 104212197 in 0.jsonl file

even though the paper passed these conditions

if not paper:
    return []

if not paper['grobid_parse']:
    print(f'paper[{paper["paper_id"]} has grobid parse')
    return []

if not paper['grobid_parse']['body_text']:
    print(f"paper[{paper['paper_id']} has no body text")
    return []

this condition cite_ref in paper['grobid_parse']['bib_entries'] is not satisfied which leads to the result get_citation_contexts = []

kyleclo commented 4 years ago

[] and None are what we're using internally to keep track of this. I guess our assumption was that people writing if statements in Python wouldn't run into any issues here because empty lists and None will pass True/False in the same manner.
The null section fields is a bug. Will fix in the upcoming update.
Sorry I don't understand what you're referring to by get_citation_contexts = []. Is this code that you've written?

Mayar2009 commented 4 years ago

@kyleclo sorry, maybe I wrote in the wrong way... I mean that the get_citation_contexts function gives []

Mayar2009 commented 4 years ago

Hi! Could you please explain what is the purpose of using s2_pdf_hash" for the beginning I understood that this is the number of paper in semantic scholar dataset, or I misunderstand? "? What is the better way to get the pdf of a paper from the internet depending on your dataset, i.e on which number should I depend on (s2_pdf_hash, ido , or other numbers)? this point is not clear for me.

Mayar2009 commented 4 years ago

@kyleclo could you please tell why sometimes a paper has grobid_parse andlatex_parse and why?

kyleclo commented 4 years ago

The get_citation_contexts function returns [] when there is no full text, or there is no detected citation mention within that full text.
s2_pdf_hash is the SHA1 of the PDF used to produce the grobid_parse. We don't provide any way to get the PDF from the internet.
Papers can have grobid_parse when we had a PDF available to parse. Papers have latex_parse when we had a LaTeX file available to parse.

Mayar2009 commented 4 years ago

@kyleclo I mean there are papers that have both, why?

so s2_pdf_hash is the SHA1 of the PDF used to produce the grobid_parse and it is not related to paper id in semantic scholar database

kyleclo commented 4 years ago

There are papers for which we have both a PDF and a LaTeX file, in which case, both parses are available.

s2_pdf_hash is the SHA1 of the PDF used to produce the grobid_parse. This often works with SemanticScholar in that semanticscholar/paper/<s2_pdf_hash> will likely return the paper in question, but it's not guaranteed.

Mayar2009 commented 4 years ago

@kyleclo There are papers for which we have both a PDF and a LaTeX file, in which case, both parses are available Even it is confused somehow but again forgive me for the question for example, in the file 0.jsonl, the paper which has the paper_id ('10164018') has both grobid_parse and latex_parse in the same json file and not in two separate json files. so I am confused about (is it possible that a paper document be in the same time pdf document and latex_document?) or I missed something?

kyleclo commented 4 years ago

No worries, let me try explaining a different way.

Most papers on arXiv have an uploaded PDF as well as a LaTeX source file dump. We wrote separate parsers for both the PDF as well as the LaTeX. We don't want to force people to use one text source versus the other, so we included both of these for that same arXiv paper. It's up to you whether you want to use the PDF-parse, the LaTeX-parse, or both, or neither.

We don't want them in separate JSON files because they're technically the same paper, and we want to ensure one-JSON-per-paper.

Think of it more as different representations of the same paper. For example, in the future, when we parse XML or HTML representations of papers, we might have 3 keys: grobid_parse latex_parse and xml_parse

Mayar2009 commented 4 years ago

Thanks for the worthful explanation

Mayar2009 commented 4 years ago

@kyleclo HI again! is it possible in bib_entries that paper has link to itself? I mean in one or more entry in bib_entries sometimes accrue that the value of link is the same as paper_id for example grobid: paper_id 5391048 value of link 5391048 grobid: paper_id 1031488 value of link 1031488 grobid: paper_id 1144073 value of link 1144073 latex: paper_id 1144073 value of link 1144073 grobid: paper_id 31479391 value of link 31479391 grobid: paper_id 5094703 value of link 5094703 latex: paper_id 5094703 value of link 5094703 grobid: paper_id 14991358 value of link 14991358 grobid: paper_id 14991358 value of link 14991358 grobid: paper_id 5108903 value of link 5108903 grobid: paper_id 436023 value of link 436023 grobid: paper_id 8051500 value of link 8051500 grobid: paper_id 17251243 value of link 17251243 grobid: paper_id 5391048 value of link 5391048 grobid: paper_id 1031488 value of link 1031488 grobid: paper_id 1144073 value of link 1144073 latex: paper_id 1144073 value of link 1144073 grobid: paper_id 31479391 value of link 31479391 grobid: paper_id 5094703 value of link 5094703 latex: paper_id 5094703 value of link 5094703 grobid: paper_id 14991358 value of link 14991358 grobid: paper_id 14991358 value of link 14991358 grobid: paper_id 5108903 value of link 5108903 grobid: paper_id 436023 value of link 436023 grobid: paper_id 8051500 value of link 8051500 grobid: paper_id 17251243 value of link 17251243 grobid: paper_id 9004962 value of link 9004962 grobid: paper_id 14975634 value of link 14975634 grobid: paper_id 14171478 value of link 14171478 grobid: paper_id 152281 value of link 152281 grobid: paper_id 15010792 value of link 15010792 grobid: paper_id 18878369 value of link 18878369 grobid: paper_id 2075553 value of link 2075553 latex: paper_id 2075553 value of link 2075553 grobid: paper_id 31703928 value of link 31703928 grobid: paper_id 52800576 value of link 52800576 grobid: paper_id 15750809 value of link 15750809 grobid: paper_id 15750809 value of link 15750809 grobid: paper_id 16051527 value of link 16051527 grobid: paper_id 12623074 value of link 12623074 grobid: paper_id 17996972 value of link 17996972 grobid: paper_id 5391048 value of link 5391048 grobid: paper_id 1031488 value of link 1031488 grobid: paper_id 1144073 value of link 1144073 latex: paper_id 1144073 value of link 1144073 grobid: paper_id 31479391 value of link 31479391 grobid: paper_id 5094703 value of link 5094703 latex: paper_id 5094703 value of link 5094703 grobid: paper_id 14991358 value of link 14991358 grobid: paper_id 14991358 value of link 14991358 grobid: paper_id 5108903 value of link 5108903 grobid: paper_id 436023 value of link 436023 grobid: paper_id 8051500 value of link 8051500 grobid: paper_id 17251243 value of link 17251243 grobid: paper_id 9004962 value of link 9004962 grobid: paper_id 14975634 value of link 14975634 grobid: paper_id 14171478 value of link 14171478 grobid: paper_id 152281 value of link 152281 grobid: paper_id 15010792 value of link 15010792 grobid: paper_id 18878369 value of link 18878369 grobid: paper_id 2075553 value of link 2075553 latex: paper_id 2075553 value of link 2075553 grobid: paper_id 31703928 value of link 31703928 grobid: paper_id 52800576 value of link 52800576 grobid: paper_id 15750809 value of link 15750809 grobid: paper_id 15750809 value of link 15750809 grobid: paper_id 16051527 value of link 16051527 grobid: paper_id 12623074 value of link 12623074 grobid: paper_id 17996972 value of link 17996972 grobid: paper_id 9564539 value of link 9564539 grobid: paper_id 6422949 value of link 6422949 grobid: paper_id 2345236 value of link 2345236 grobid: paper_id 2493136 value of link 2493136 grobid: paper_id 7275077 value of link 7275077 grobid: paper_id 232453 value of link 232453 grobid: paper_id 18691054 value of link 18691054 grobid: paper_id 1159457 value of link 1159457 latex: paper_id 3231502 value of link 3231502 grobid: paper_id 3433006 value of link 3433006 grobid: paper_id 487442 value of link 487442 grobid: paper_id 24901977 value of link 24901977 grobid: paper_id 405878 value of link 405878 grobid: paper_id 12742267 value of link 12742267 grobid: paper_id 20356569 value of link 20356569 grobid: paper_id 20356569 value of link 20356569 grobid: paper_id 53536870 value of link 53536870 grobid: paper_id 7376440 value of link 7376440 in wich situation this happen? maybe when referncing in the same paper to another section? I understand that bib_entries is for referencing papers in reference section of a paper so that is why it is not clear for me

kyleclo commented 4 years ago

Hey @Mayar2009, thanks I'm looking into it; this is most definitely a situation where we couldn't find a better link & forgot to enforce a hard constraint about self-citation. I'd consider self-linked references should be null valued.

kyleclo commented 4 years ago

Hey @Mayar2009, would it be alright if I closed this issue? It's a bit hard to follow since there are a lot of things being discussed in one thread. I believe with the new release of version 20200705v1, a lot of the normalization issues & missing fields bug have been resolved. The README also details the schema more clearly now.

allenai / s2orc

some questions about 0.jsonl for example #6

paper[0]["metadata"]["abstarct"] which has number (104172): has not abstract in metadata paper[0]["grobid_parse"]["abstract"] which has number (104172): []

paper[1]["metadata"]["abstarct"] which has number (1003291): has not abstract in metadata paper[1]["grobid_parse"]["abstract"] which has number (1003291): None

paper[3]["metadata"]["abstarct"] which has number (10006097): has not abstract in metadata paper[3]["grobid_parse"]["abstract"] which has number (10006097): None

paper[4]["metadata"]["abstarct"] which has number (10022478): has not abstract in metadata paper[4]["grobid_parse"]["abstract"] which has number (10022478): None

paper[8]["metadata"]["abstarct"] which has number (10103267): has not abstract in metadata paper[8]["grobid_parse"]["abstract"] which has number (10103267): None