clowder-framework / extractors-s2orc-pdf2text

Extractor to convert pdf to text
Apache License 2.0
1 stars 0 forks source link

Remove reference numbers from text #18

Closed minump closed 8 months ago

minump commented 11 months ago

The reference numbers (citations) are present in the .txt file. These reference numbers appear as belonging to a sentence in the .txt file. The json file looks like the below:

{
                "text": "Both groups had similar baseline characteristics (table 1 ).The median time from randomisation to start of the study infusion was similar in both groups (salbutamol 1\u20223 h, IQR 0\u20226-2\u20225; placebo 1\u20221 h, 0\u20226-2\u20222).Patients in the salbutamol group were more likely to have their infusion stopped early than were those in the placebo group, either because of death (14/161 vs eight of 163), or the development of signifi cant side-eff ects (47/161 vs 13/163).The duration of infusion was on average 24\u20225 h (95% CI 12\u20223-36\u20227) shorter in the salbutamol group than in the placebo group (mean 114\u20221 h [SD 62 \u20227 ] vs 138\u20226 h [47 \u20229] ; fi gure 2).The risks of patients developing a tachycardia, new arrhythmia, or lactic acidosis severe enough to warrant stopping of the study drug were substantially higher in the salbutamol group than in the placebo group (table 2 ).",
                "cite_spans": [
                    {
                        "start": 597,
                        "end": 599,
                        "text": "\u20227",
                        "ref_id": null
                    },
                    {
                        "start": 617,
                        "end": 620,
                        "text": "\u20229]",
                        "ref_id": null
                    }
                ],
                "ref_spans": [
                    {
                        "start": 56,
                        "end": 57,
                        "text": "1",
                        "ref_id": "TABREF1"
                    },
                    {
                        "start": 852,
                        "end": 853,
                        "text": "2",
                        "ref_id": "TABREF3"
                    }
                ],
                "eq_spans": [],
                "section": "Results",
                "sec_num": null
            },

Here the "text" field includes "cite_spans" and "ref_spans". This needs to be removed from the "text" field.

minump commented 9 months ago

Pushed code to remove ref_spans and cite_spans. This will have the output text file (.txt) without the ref_spans and cite_spans. This will make the text content different from the original pdf text content. The model might perform better with the cleaned text (without the ref_spans and cite_spans). The tei.xml file will also need modifications to remove the references tag (ref). During the matching, the "_predicted.csv" file will have the cleaned text, which will get matched with sentences from the cleaned tei.xml file.

minump commented 8 months ago

Closing as this is not needed now.