Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.34k stars 89 forks source link

No whitespace in text? #50

Closed Filimoa closed 2 months ago

Filimoa commented 3 months ago

Discussed in https://github.com/Filimoa/open-parse/discussions/49

Originally posted by **JBGruber** June 7, 2024 I tried to parse a few complex PDFs, which worked really well. Now I put in a simpler one and was suprised to see that the result contains no whitespace. Not sure if I'm doing something wrong or if this might be a bug: ```python import openparse import urllib.request urllib.request.urlretrieve("https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0207996&type=printable", "test.pdf") basic_doc_path = "test.pdf" parser = openparse.DocumentParser() parsed_basic_doc = parser.parse(basic_doc_path) print(parsed_basic_doc.nodes[2].text) #> Abstract

a1111111111 #> a1111111111 #> a1111111111 #> a1111111111 #> a1111111111

**Introduction**

Exploitinginformationinhealth-relatedsocialmediaservicesisofgreatinterestforpatients, #> researchersandmedicalcompanies.Thechallengeis,however,toprovideeasy,quickand #> relevantaccesstothevastamountofinformationthatisavailable.Onesteptowardsfacili- #> tatinginformationaccesstoonlinehealthdataisopinionmining.Eventhoughtheclassifica- #> tionofpatientopinionsintopositiveandnegativehasbeenpreviouslytackled,mostworks #> makeuseofmachinelearningmethodsandbagsofwords.Ourfirstcontributionisanexten- #> siveevaluationofdifferentfeatures,includinglexical,syntactic,semantic,network-based, #> sentiment-basedandwordembeddingsfeaturestorepresentpatient-authoredtextsfor #> polarityclassification.Thesecondcontributionofthisworkisthestudyofpolarfacts(i.e. #> objectiveinformationwithpolarconnotations).Traditionally,thepresenceofpolarfactshas #> beenneglectedandresearchinpolarityclassificationhasbeenboundedtoopinionated #> texts.Wedemonstratetheexistenceandimportanceofpolarfactsforthepolarityclassifica- #> tionofhealthinformation. #> **Received:**January30,2018 ``` Using copy and paste in a PDF reader, it looks like this: > Exploiting information in health-related social media services is of great interest for patients, > researchers and medical companies. The challenge is, however, to provide easy, quick and > relevant access to the vast amount of information that is available. One step towards facili- > tating information access to online health data is opinion mining. Even though the classifica- > tion of patient opinions into positive and negative has been previously tackled, most works > make use of machine learning methods and bags of words. Our first contribution is an exten- > sive evaluation of different features, including lexical, syntactic, semantic, network-based, > sentiment-based and word embeddings features to represent patient-authored texts for > polarity classification. The second contribution of this work is the study of polar facts (i.e. > objective information with polar connotations). Traditionally, the presence of polar facts has > been neglected and research in polarity classification has been bounded to opinionated > texts. We demonstrate the existence and importance of polar facts for the polarity classifica- > tion of health information.
waylonli commented 2 months ago

Smae issue here. Have you solved the issue?

Filimoa commented 2 months ago

Fixed with v0.5.7