fsingletonthorn / EffectSizeScraping

MIT License
1 stars 0 forks source link

Article PFDs with keywords / etc beside the abstract can lead to the abstract being placed within the introduction text #21

Open fsingletonthorn opened 5 years ago

fsingletonthorn commented 5 years ago

This happens because the text is scraped by columns and when converted to a single column is sorted from left to right. This means that if the abstract section is read as having two columns, keywords (or whatever) on the left, the abstract is to be read in as being in column two, and will be sorted into the text below the introduction text. It is unclear how often this will occour - but does seem to in the science direct formatted PDFs. However, they have an XML text mining solution which might make things a bit easier to deal with.