WormBase / ACKnowledge

Author Curation to Knowledgebases
MIT License
1 stars 1 forks source link

GROBID transition #296

Closed draciti closed 4 months ago

draciti commented 6 months ago

Analysis on additional papers:

Valerio: I debugged the text extraction and modified GROBID parameters to include section names (headers) as sentences.

I also checked all the other issues you reported, and they are mostly GROBID extraction errors on the Cell Press STAR Protocols paper format, which was generating a lot of errors also with the old extraction method based on heuristics. See my replies below:

Daniela: General comments: for all papers below I spot-checked sentences from each section (e.g. abstract, introduction, results, figure legends, discussion)

00065855 Checked random sentences from each section, all good. The only sentences that were not extracted were section headers, e.g. 3.1. Method development for quantification of GSH-NEM and GSSG via LC-MS/MS

00065849 It worked great!

00065841 Sentences in SUMMARY not extracted

Section headers not extracted, e.g.: scRNA-seq of aging C. Elegans

Or extracted in tandem with another sentence: Cell-type-specific regulation and TF activity Differential gene expression across cell types is driven by factors that regulate mRNA production and stability.

Some sentences are not extracted: e.g. Gene expression drift is a common correlate of aging.

00065836 Headers not extracted, but not all. e.g.: 'Plasmid and cell line generation'; 'Fluorescence imaging'

An example of an header that was extracted: Recovery of protein synthesis after DNA damage depends on transcription-couple

Lots of sentences on Page 10 (paragraph Prospects and limitations of the RPS assay) were not extracted.

00065832 It did not extract the section 'Before you begin' culturing worms starting from: 'Here, we present protocols to determine binding between..' To '..and FLAG- tagged nuclear hormone receptors.' Starting on the following page, same section, the extraction worked.

Other bits were not extracted. Eg.:

draciti commented 6 months ago

Daniela to test 5 additional papers, then move to production

draciti commented 6 months ago

Tested 5 more papers, results below. Let me know if we need more

General question: When we click on ‘Sentence level classification’, does it process also the supplementary material or only the main PDF? If only the main PDF we can think of adding a separate button for processing the supplementals

00065854 blob:https://literature.alliancegenome.org/90d4ee65-cfbf-4359-a7d8-b51d7a0faa1e

Abstract not extracted but the display of the abstract is peculiar, almost like in a separate box Other than that, it worked great!

00065853 First sentence of the introduction not captured, probably because there is a charachter (T) that spans two rows: THE brain is one of the most studied and complex systems of the biological systems, because neurological disorders are closely related to changes in brain structure Other sentences in the paragraph look fine

First sentence of figure caption #1 not captured, but it is not even possible to copy-paste manually Caption of table III also problematic

00065850 NOTE: It did not extract any sentence

00065847 It worked great

00065846 It worked great

valearna commented 4 months ago

Is this still ongoing @draciti ? Can we close this issue?

draciti commented 4 months ago

We can close this @valearna , GROBID is working pretty well!