Minor bugs in tokenisation / tagging #3

Closed stefano-bragaglia closed 9 years ago

stefano-bragaglia commented 9 years ago

Hi, I recently came across ClearNLP and decided to give it a try. I've used other NLP frameworks in the past and I am positively impressed by the overall quality of yours. I've found some little bugs that I'm going to list below, hoping they will help you to make ClearNLP become even better.

I've used the dataset of sentences here and I analysed them as you suggest to do when you show how to use the API. I arranged the results by _POS Tag_, counting the occurrences of word (case insensitive). These are the results that I got:

POS-Tags Tokens
'' " (18)
, , (318), (5), / (3), worldwide. (1), (1)
-LRB- [ (62), ( (31)
-RRB- ] (62), ) (31)
. . (151), ... (1), ? (1)
: : (9), ; (3), (2)
ADD ɡreɪt (1)
CC and (166), or (47), but (7), either (5), Yet (1), both (1), ənd (1)
CD 1 (15), one (15), 3 (9), 2 (8), 4 (5), 5 (4), 6 (4), 8 (4), million (3), two (3), 10 (2), 11 (2), 12 (2), 14 (2), 1886 (2), 7 (2), 9 (2), four (2), three (2), 1/100 (1), 13 (1), 1397. (1), 1472. (1), 15 (1), 16 (1), 17 (1), 18 (1), 190 (1), 1901 (1), 1926 (1), 1952 (1), 1986. (1), 20 (1), 20-30 (1), 2000 (1), 2010 (1), 243,610 (1), 500 (1), 64.1 (1), 94,060 (1), billion (1), eight (1), fourteen (1)
DT the (269), A (122), an (17), this (16), These (10), another (7), some (6), Both (4), that (4), each (3), any (2), those (2), all (1), no (1)
EX there (2)
FW i.e. (2), commodity. (1), etc (1), societies. (1)
HYPH - (27), (2), (1)
IN of (189), in (116), to (52), as (35), for (34), by (31), with (22), on (20), from (17), between (12), at (10), while (8), during (5), that (5), Due (3), if (3), into (3), than (3), through (3), under (3), within (3), about (2), against (2), along (2), over (2), since (2), throughout (2), Although (1), Unlike (1), after (1), among (1), because (1), off (1), once (1), pairwise (1), so (1), towards (1), underlie (1), until (1), upon (1), whether (1), without (1), ˈbrɪtən (1)
JJ social (49), other (17), financial (15), such (13), many (10), public (8), first (6), German (5), important (5), ancient (4), complex (4), global (4), modern (4), same (4), specific (4), academic (3), amino (3), certain (3), computational (3), current (3), different (3), early (3), equal (3), genetic (3), governmental (3), individual (3), major (3), scientific (3), various (3), whole (3), 14th (2), 20th (2), Further (2), Human (2), New (2), accessible (2), behavioral (2), biological (2), constitutional (2), dependent (2), discrete (2), general (2), influential (2), international (2), judicial (2), large (2), later (2), latter (2), long (2), metabolic (2), national (2), nonsexual (2), northern (2), particular (2), practical (2), principal (2), real (2), regulatory (2), rich (2), several (2), sociological (2), sovereign (2), statistical (2), subject (2), theoretical (2), urban (2), 15th (1), 16th (1), 17th (1), 18th (1), 22nd-most (1), Abnormal (1), American (1), British (1), European (1), Italian (1), Judicious (1), Short (1), Strong (1), abstract (1), active (1), additional (1), adjacent (1), administrative (1), alternative (1), asymmetric (1), available (1), average (1), basic (1), callable (1), central (1), centralized (1), chemical (1), common (1), consistent (1), contemporary (1), critical (1), cultural (1), curious (1), detailed (1), devoid (1), dimensional (1), direct (1), dyadic (1), eastern (1), electrical (1), empirical (1), epistemological (1), executable (1), executive (1), famous (1), firm (1), flexible (1), formal (1), former (1), fractional (1), fundamental (1), generic (1), geographic (1), great (1), half (1), hermeneutic (1), institutional (1), interconnected (1), interdisciplinary (1), interested (1), internal (1), interpersonal (1), interpretative (1), intractable (1), key (1), legal (1), legislative (1), like (1), linear (1), linguistic (1), liquid (1), local (1), locomotive (1), logistical (1), macro (1), mammalian (1), mathematical (1), meaningless (1), medical (1), medieval (1), methodical (1), metropolitan (1), micro (1), mid-twentieth (1), military (1), minimum (1), monetary (1), much (1), multinational (1), multiple (1), nascent (1), natural (1), next (1), non-governmental (1), non-peptide (1), non-profit (1), notional (1), nucleotide (1), only (1), open (1), opposite (1), parliamentary (1), penal (1), peptide (1), pervasive (1), philosophic (1), physical (1), populous (1), posttranslational (1), powerful (1), prime (1), principled (1), professional (1), prosthetic (1), qualitative (1), quantitative (1), recent (1), regional (1), responsible (1), retail (1), rigorous (1), second (1), self-powered (1), small (1), socialist (1), societal (1), spatial (1), square (1), stable (1), standard (1), structural (1), structured (1), suburban (1), symmetric (1), systematic (1), threaded (1), traditional (1), true (1), typical (1), unneeded (1), unstable (1), usable (1), useful (1), vast (1), vehicular (1), visual (1), western (1), wheeled (1), wide (1)
JJR more (2), larger (1), less (1), smaller (1)
JJS largest (6), most (5), oldest (3), best (2), least (1)
MD can (16), may (13), should (2), might (1), will (1)
NFP (s) (1), :- (1), ;-) (1)
NN network (19), credit (16), theory (14), bank (13), Computer (11), graph (11), money (11), banking (9), car (9), loan (9), policy (9), programming (9), sociology (9), analysis (8), behavior (8), century (8), protein (8), subroutine (8), account (7), study (7), system (7), activity (6), amount (6), part (6), program (6), sequence (6), structure (6), use (6), world (6), acid (5), borrower (5), capital (5), lender (5), research (5), science (5), set (5), society (5), state (5), term (5), transportation (5), unit (5), amino (4), area (4), code (4), computation (4), cost (4), country (4), debt (4), default (4), entity (4), interest (4), land (4), number (4), party (4), place (4), time (4), vehicle (4), Internet (3), air (3), centrality (3), change (3), communication (3), concept (3), context (3), development (3), form (3), function (3), health (3), interaction (3), level (3), luxury (3), market (3), nature (3), north (3), object (3), period (3), protection (3), seller (3), subprogram (3), task (3), value (3), variety (3), year (3), Web (2), action (2), agency (2), animal (2), approach (2), automobile (2), balance (2), basis (2), border (2), city (2), class (2), customer (2), date (2), dei (2), deposit (2), discipline (2), driving (2), example (2), field (2), finance (2), folding (2), freedom (2), fuel (2), gasoline (2), gene (2), history (2), information (2), insurance (2), island (2), knowledge (2), language (2), law (2), lending (2), life (2), lifespan (2), maintenance (2), manufacturer (2), method (2), mobility (2), name (2), parking (2), physics (2), pollution (2), polypeptide (2), preference (2), range (2), rate (2), receivable (2), representation (2), response (2), return (2), risk (2), role (2), sense (2), spread (2), swap (2), trade (2), trading (2), transport (2), usage (2), way (2), 1–2 (1), 2007–200 (1), Archaeology (1), Bisociality (1), DNA (1), Government (1), List (1), Monosociality (1), P (1), Road (1), Subject (1), ability (1), access (1), acquisition (1), addition (1), agent (1), aggression (1), altruism (1), application (1), archaea (1), array (1), article (1), asset (1), association (1), auto (1), baggage (1), barter (1), behaviour (1), biology (1), birth (1), body (1), bond (1), branch (1), brand (1), business (1), buyer (1), call (1), care (1), case (1), cause (1), cell (1), cell. (1), centre (1), chain (1), coast (1), combustion (1), comfort (1), complexity (1), computing (1), conditioning (1), consumer (1), continuation (1), contract (1), contrast (1), convenience. (1), creation (1), creator (1), credere (1), creditor (1), creditworthiness (1), crisis (1), crossover (1), deal (1), debate (1), debtor (1), decision (1), defence (1), deflagration (1), demand (1), depreciation (1), description (1), design (1), destruction (1), deviance (1), diesel (1), disease (1), disorder (1), distinction (1), division (1), east (1), economy (1), edge (1), education (1), end (1), engine (1), engineering (1), entertainment (1), equity (1), ethanol (1), everyone (1), evidence (1), exchange (1), execution (1), existence (1), expression (1), feasibility (1), fee (1), flow (1), focus (1), foundation (1), funding (1), gas (1), generation (1), glossary (1), goal (1), governance. (1), grain (1), group (1), guide (1), headquarters (1), hierarchy (1), holder (1), importance (1), incentive (1), independence (1), individual (1), influence (1), infrastructure (1), injury (1), institution (1), instruction (1), intercity (1), intermediary (1), interplay (1), interurban (1), invention (1), inventor (1), investigation (1), job (1), justice (1), legislation (1), leisure (1), liability (1), liquidity (1), location (1), locomotive (1), machinery (1), macro. (1), mainland (1), making (1), manner (1), material (1), matter (1), meaning (1), mechanism (1), mechanization (1), memory (1), merchant (1), modelling (1), modification (1), monarch (1), monarchy (1), motor (1), motorcar (1), movement (1), navigation (1), nb (1), note (1), nothing (1), oder (1), order (1), organisation (1), organization (1), par (1), parenthesis (1), particle (1), passenger (1), payment (1), payment. (1), percent (1), person (1), perspective (1), petrol (1), physiology (1), point (1), popularity (1), population (1), portion (1), power (1), practice (1), predation (1), premium (1), price (1), principal (1), problem (1), procedure (1), process (1), processing (1), production (1), pronunciation (1), proof (1), prototype (1), provider (1), provision (1), psychology (1), purest (1), pyrrolysine (1), quality (1), rail (1), railroad (1), realisation (1), reallocation (1), receiver (1), regard (1), regulation (1), relation (1), reliability. (1), religion (1), repayment (1), representation. (1), reputation (1), reserve (1), responsibility (1), result (1), revenue (1), rise (1), routine (1), safety (1), scale (1), scapegoating (1), science. (1), scientist (1), sea (1), seating (1), secularisation (1), selling (1), sentence (1), service (1), sex (1), sexuality (1), sharing (1), sheet (1), size (1), slogan (1), software (1), solution (1), source (1), south (1), space (1), sq (1), stability (1), step (1), storage (1), stratification (1), structure. (1), support (1), syntax (1), synthesis (1), systems. (1), tax (1), today (1), tool (1), topic (1), traffic (1), transfer (1), translation (1), travel (1), trust (1), turn (1), turnover (1), type (1), umbrella (1), understanding (1), vertex (1), wealth (1), welfare (1), wellbeing (1), word (1), world. (1)
NNP Benz (9), UK (6), United (6), Europe (5), Mercedes (5), Britain (4), Ireland (4), Italy (4), Kingdom (4), Daimler (3), Great (3), Northern (3), Siena (3), China (2), Empire (2), English (2), Financial (2), Florence (2), India (2), Ireland. (2), Karl (2), London (2), Medici (2), Monte (2), Motorwagen (2), Ocean (2), Paschi (2), Patent (2), Renaissance (2), Republic (2), Roman (2), Sea (2), States (2), Union (2), di (2), (2), AG (1), Allen (1), America (1), Amsterdam (1), Analysis (1), Association (1), Assyria (1), Atlantic (1), Audi (1), BC (1), BMW (1), Babylonia (1), Baden (1), Bank (1), Bardi (1), Basel (1), Belfast (1), Berenberg (1), Berenbergs (1), Beste (1), Big (1), Cardiff (1), Channel (1), ClearNLP (1), Company (1), Comparing (1), Crown (1), Das (1), David (1), Douglas (1), Dutch (1), Edinburgh (1), Elizabeth (1), England (1), Euler (1), Europe. (1), European (1), Eurostat. (1), Falkland (1), February (1), Ford (1), Franklin (1), Gale (1), Genoa (1), Georg (1), Germany (1), Gesellschaft (1), Gibraltar (1), Gill (1), Giovanni (1), Greece (1), Guernsey (1), Gurusamy (1), Holy (1), II (1), Indian (1), Irish (1), Isle (1), Jacob (1), Jersey (1), Königsberg (1), Latin (1), Listeni (1), Man (1), Management (1), Maurice (1), Medicis (1), Model (1), Moreno (1), Motor (1), Motoren (1), North (1), Overseas (1), Peruzzi (1), Policy (1), Public (1), Queen (1), Scotland (1), Seven (1), Simmel (1), Soviet (1), Stanley (1), Stuttgart (1), T (1), Territory (1), U.S. (1), Venice (1), Wales (1), Western (1), Wheeler (1), Wide (1), Wilkes (1), World (1), Württemberg (1), mi (1), selenocysteine (1), ˈaɪərlənd (1), ˈnɔrðərn (1), ˈproʊti.ɨnz (1), ˈproʊˌtiːnz (1), (1)
NNPS Systems (2), Accords (1), Bridges (1), Fuggers (1), Islands (1), NICs. (1), Rothschilds (1), Services (1), Territories (1), Welsers (1)
NNS networks (10), Proteins (9), banks (8), cars (8), systems (8), relations (7), residues (7), benefits (5), countries (5), institutions (5), methods (5), objects (5), subroutines (5), vehicles (5), Applications (4), loans (4), people (4), sciences (4), Behaviors (3), Examples (3), approaches (3), cities (3), definitions (3), economies (3), fields (3), functions (3), goods (3), graphs (3), markets (3), mathematics (3), members (3), programs (3), regulations (3), resources (3), roads (3), species (3), structures (3), theories (3), vertices (3), acids (2), actors (2), automakers (2), bonds (2), branches (2), challenges (2), concepts (2), controls (2), costs (2), decisions (2), disciplines (2), dynamics (2), edges (2), entities (2), fuels (2), funds (2), genes (2), groups (2), humans (2), innovations (2), issues (2), languages (2), laws (2), libraries (2), lines (2), molecules (2), nodes (2), operations (2), opportunities (2), organisations (2), organisms (2), origins (2), others (2), parts (2), patterns (2), policies (2), problems (2), properties (2), researchers (2), restrictions (2), scholars (2), services (2), students (2), techniques (2), terms (2), things (2), transactions (2), 1930s (1), 1950s (1), 1980s. (1), Movements (1), Subprograms (1), accidents (1), actions (1), activities (1), acts (1), administrations (1), administrators (1), affiliations (1), algorithms (1), alternatives (1), analysis. (1), aqueducts (1), aspects (1), assets (1), automobiles (1), bits (1), books (1), borrowers. (1), buses (1), calls (1), capitals (1), carriages (1), carts (1), cells (1), centuries (1), chains (1), changes (1), citizenship. (1), classes (1), coaches (1), cofactors (1), complexes (1), computations (1), computers (1), constitutions (1), contracts (1), courses (1), covenants (1), customs (1), days (1), deaths (1), decades (1), deposits (1), developers (1), developments (1), dynasties (1), economics (1), educators (1), emoticons (1), failures (1), families (1), farmers (1), fees (1), fields. (1), focuses (1), goods. (1), graphics (1), ideas (1), implications (1), indicators (1), individuals (1), inhabitants. (1), institutions. (1), instructions (1), instruments (1), interpretations (1), investors (1), islands (1), kilometres (1), lawmakers (1), lenders (1), liabilities (1), macromolecules (1), magnates (1), makers (1), managers (1), masses (1), materials (1), measures (1), merchants (1), minutes (1), mɛʁˈt͡seːdəs (1), nations (1), networks. (1), nichts (1), numbers (1), obligations (1), organizations (1), origin. (1), paradigms (1), parties (1), passengers (1), peptides (1), pipes (1), places (1), planners (1), points (1), politicians (1), polypeptides (1), powers (1), practices (1), practitioners (1), priorities (1), procedures (1), processes (1), professors (1), railways (1), reactions (1), relationships (1), repairs (1), representatives (1), requirements (1), roots (1), routes (1), savers (1), savings (1), schools (1), services. (1), sexes (1), shyness. (1), smileys (1), societies (1), sociograms (1), sociologists (1), spaces (1), spheres (1), spreaders (1), standards (1), statistics (1), stimuli (1), streets (1), structures. (1), subfields (1), subjects (1), substrates (1), systems. (1), tabulations (1), tasks (1), taxes (1), telecommunications (1), temples (1), ties (1), times (1), topics (1), traders (1), transfers (1), triads (1), trucks (1), turns (1), types (1), universities (1), variations (1), warming. (1), ways (1), wheels (1), workers (1), years (1)
POS 's (10), ' (1), ˈbɛnt͡s (1)
PRP It (10), they (8), itself (4), them (4), I (3), One (1)
PRP$ its (15), their (6)
RB also (18), not (9), often (7), commonly (5), generally (4), directly (3), only (3), primarily (3), rapidly (3), sometimes (3), then (3), together (3), widely (3), However (2), Still (2), about (2), as (2), back (2), e.g. (2), first (2), highly (2), mathematically (2), much (2), putatively (2), rather (2), thereby (2), usually (2), well (2), 11th (1), 78th (1), Apart (1), Conversely (1), Further (1), Once (1), Perhaps (1), Shortly (1), after (1), analytically (1), around (1), basically (1), broadly (1), chemically (1), closely (1), computationally (1), continuously (1), correctly (1), dramatically (1), effectively (1), efficiently (1), especially (1), essentially (1), even (1), far (1), flexibly (1), formerly (1), gradually (1), immediately (1), increasingly (1), indirectly (1), inherently (1), initially (1), instead (1), just (1), loosely (1), necessarily (1), normally (1), notably (1), now (1), principally (1), rarely (1), respectively (1), second (1), separately (1), substantially (1), super (1), typically (1), ultimately (1), universally (1), up (1)
RBR more (5), Later (1), less (1), longer (1)
RBS most (3)
RP up (2)
SYM / (4)
TO to (32)
VB be (23), include (5), Refer (3), see (3), have (2), pay (2), repay (2), achieve (1), act (1), become (1), believe (1), cause (1), charge (1), climate (1), conduct (1), consist (1), denote (1), develop (1), distinguish (1), encourage (1), engage (1), ensure (1), examine (1), exist (1), form (1), identify (1), induce (1), let (1), locate (1), measure (1), model (1), move (1), operate (1), place (1), provide (1), reduce (1), reflect (1), reimburse (1), require (1), return (1), run (1), serve (1), solve (1), specify (1), study (1), support (1), work (1)
VBD was (6), were (4), had (3), called (2), caused (2), made (2), took (2), accepted (1), added (1), appeared (1), applied (1), authored (1), became (1), began (1), built (1), carried (1), changed (1), deposited (1), did (1), dominated (1), emerged (1), led (1), provoked (1), referred (1), replaced (1), termed (1)
VBG including (5), being (3), making (3), According (2), Acting (2), developing (2), existing (2), increasing (2), lending (2), maintaining (2), using (2), writing (2), Lying (1), acquiring (1), analyzing (1), ascending (1), catalyzing (1), compiling (1), comprising (1), concerning (1), consisting (1), containing (1), contributing (1), corresponding (1), creating (1), describing (1), disposing (1), emphasizing (1), encompassing (1), establishing (1), explaining (1), funding (1), gaining (1), generating (1), granting (1), identifying (1), implementing (1), issuing (1), living (1), loaning (1), meaning (1), operating (1), reaching (1), refining (1), replicating (1), resolving (1), responding (1), resulting (1), taking (1), transporting (1), underlying (1), varying (1)
VBN used (11), known (8), based (6), called (6), been (3), considered (3), developed (3), directed (3), made (3), provided (3), applied (2), attached (2), credited (2), defined (2), degraded (2), designed (2), encoded (2), estimated (2), expanded (2), followed (2), performed (2), recorded (2), regarded (2), traded (2), accepted (1), added (1), adopted (1), affected (1), associated (1), authorized (1), balanced (1), blamed (1), bonded (1), closed (1), coded (1), collected (1), composed (1), constructed (1), contrasted (1), deferred (1), denominated (1), denoted (1), deposited (1), derived (1), described (1), devolved (1), dictated (1), disputed (1), divided (1), done (1), drawn (1), electrified (1), embodied (1), employed (1), enforced (1), equipped (1), established (1), evidenced (1), evolved (1), extended (1), formalized (1), formed (1), fueled (1), funded (1), given (1), headquartered (1), institutionalised (1), intended (1), lent (1), manufactured (1), measured (1), misfolded (1), modified (1), obligated (1), observed (1), organized (1), oriented (1), owed (1), packaged (1), perceived (1), played (1), powered (1), promulgated (1), propelled (1), recycled (1), referenced (1), regulated (1), related (1), risen (1), seen (1), started (1), studied (1), surrounded (1), taken (1), targeted (1), traced (1), transcribed (1), undirected (1), used. (1), weighed (1), withdrawn (1)
VBP are (39), include (7), have (6), focus (2), perform (2), 'm (1), add (1), associate (1), connect (1), define (1), depend (1), differ (1), draw (1), emphasize (1), exchange (1), exist (1), hold (1), identify (1), increase (1), provoke (1), study (1), value (1), viewpoint (1)
VBZ is (80), has (10), allows (3), provides (3), considers (2), describes (2), does (2), includes (2), receives (2), refers (2), represents (2), specifies (2), takes (2), uses (2), alters (1), arranges (1), begins (1), behaves (1), borrows (1), concerns (1), consists (1), contains (1), covers (1), creates (1), delivers (1), dependencies (1), detects (1), determines (1), encompasses (1), entails (1), explores (1), focuses (1), forms (1), generates (1), happens (1), informs (1), investigates (1), involves (1), lies (1), means (1), oligopeptides (1), pays (1), permits (1), ranges (1), results (1), shares (1), shows (1), specializes (1), suggests (1), traces (1), varies (1)
WDT which (31), that (17)
WP What (2), who (1)
WRB where (2), wherever (1)
XX (s) (1)
```` | " (17)

And here are the bugs:

And the following are probably only due to the specific set of documents used for training (so not really bugs):

As a side note, I suggest to always load all the files for the default dictionary even if it eats a lot of memory (12Gb on my machine) and it takes quite some time when you can't train your own dictionary, because the quality of the NLP improves significantly.

I hope this helps and thanks for maintaining this framework and keep up the great work!

jdchoi77 commented 9 years ago


Please take a look at the output from the new version for the data you posted. Thanks for pointing these out. Please let me know if you find more bugs.



stefano-bragaglia commented 9 years ago

Thanks a lot: I just tried out version 3.1.2 with dictionary version 3.1 and I got the results you pointed out: same precision, half loading time, half used heap (6 Gb vs 12 Gb for general dictionary).

Though, ( is still recognised as [ with tag -LRB- (and similarly ) and ] as -RRB-) while they should recognised as -LRB- and -LSB- (-RRB- and -RSB) respectively.

Just in case, I couldn't express myself clear enough, consider the following:

TEXT POS_TAG Description
( -LRB- Left round bracket
[ -LSB- Left squared bracket
{ -LCB- Left curly bracket
) -RRB- Right round bracket
] -RSB- Right squared bracket
} -RCB- Right curly bracket

Sometimes it is important to distinguish between them, for instance dealing with Wikipedia pages: round brackets are often used to punctualness some concept, while squared bracket are used to introduce references of footnotes. The following sentence, for instance is a good example of both:

Credit (from Latin credere translation. "to believe") encompasses any form of deferred payment.[1]

Congratulations again for the wonderful software!

jdchoi77 commented 9 years ago


The POS tags for the left and right * brackets are -LRB- and -RRB- by the Penn Treebank guidelines so that’s why ClearNLP is producing those tags. They can be distinguished from their word forms though. Please let me know if you have more suggestions on ClearNLP. Thanks!



