Schwittleymani / ECO

Electronic Chaos Oracle
https://schwittlick.net/eco
Apache License 2.0
6 stars 1 forks source link

Refine Text parser #166

Closed schwittlick closed 7 years ago

schwittlick commented 7 years ago

add another rule:

if any one of the characters of a sentence is not within the ascii range, drop the sentence

if ord(char) > 127|| ord(char) <  32:
    drop_sentence()

maybe make a whitelist of allowed characters based on this list: http://www.ascii-code.com/

COMMENT: vll kann man die die wörter & sätze behalten. konvertieren in ascii replace utf-8(oder unicode,,, ka!). dann sind die wörter verkrüppelt aber die kann man bestimmt wieder grade biege? nach magischer ml power gucken. charRNN, ...?

schwittlick commented 7 years ago

and remove all digits

schwittlick commented 7 years ago

remove sentences that start with puncutation: ..,_

example sentence that passed the parser in library_and_archive_theory/adams,\ thomas-r-a-new-model-for-the-study-of-the-book_valid.txt:

..,_lcdge or inspiration that outlives the time in which they were fìrst conceived or written .
ils ~ ~ its outward or inward ~ cw small .
: plains their existence and ~ f this , which we consider
· own small patcb , without The most significant recent ; igure S .
liàil • ~ rstanding of 4 pmWi.Wng achieved ib p atively small aspe
Luc:ien Febvre and Henr :( oming ef the Booi : ; The lm ?
‘ ‘ tight ’ ’ internal structure whereas others allow member libraries to opt in or out of particular purchasing deals or other collaboration .

another issue is weird characters, like in :digital_and_internet_theory/Kahin-ed-Brian-Advancing-Knowledge-and-Knowledge-Economy_valid.txt:

The value of knowledge can lie in its â<U+0080><U+0098>â<U+0080><U+0098>inďŹ<U+0081>nite expansibilityâ<U+0080><U+0099>â<U+0080><U+0099>â<U+0080><U+0094>or in its novelty and enforced scarcity .
Directorate for ScientiďŹ<U+0081>c Affairs , DAS/ PD/62.47 , Paris : OECD .

more weird special characters and unfinished sentences in arts_arthistory_aesthetics/Holmes-Hieroglyphs\ of\ the\ Future\ (2003)_valid.txt:

There is room , in the networked world , for �
; : ; through public spending ; but the durable factor prohibiting any
rpart to the NATO fantasy of war without casualties .
of a precise surgical operation sustained by the ideology ofgloballlictimization .
the NATO bombardment - the faked camil'alization of the war ...
this obscene camil'alization of social life is effectil'ely the other .

shitload of unfinished sentences in arts_arthistory_aesthetics/Jones-Caroline-Reconstituting-Systems-Art-Hans-Haacke-1967_valid.txt:

Haacke was not alone in this impulse , of course , as Robert
even more certain with a second version of this article pub -
ers organized to support the protest by kinetic artist Takis ,
Smithson took friends on outings to New Jersey , and perfor -
tice of Olafur Eliasson , who has studied Haacke ’ s systems
scholarship of Meg Rotzel at MIT for revealing this fascinat -
his Software exhibition at the Jewish Museum , New York ,
was rather incidental in a sequence of works of nearly sci -
and will be part of society at large , interacting with it .
in the singular plural of the spectator ’ s body : after all , both
probably became clear to me in the mid-sixties , but I had
Notably , the reviews of Haacke are on pages of the student
beauty and rainbow exist only in the eye of the beholder ,
of Ithaca Falls : Freezing and Melting on Rope , February
ring to the photographs he took of visitors , art handlers , and
schwittlick commented 7 years ago

parsed everything new and put it here:

/home/marcel/drive/data/eco/NAIL_DATAFIELD_txt/parsed_v3/parsed_v3_valid_combined.txt

includes our own pdf collection as well. statistics here: statistics_v3