Closed stefano-bragaglia closed 9 years ago
Stefano,
Please take a look at the output from the new version for the data you posted. Thanks for pointing these out. Please let me know if you find more bugs.
best,
Jinho
Thanks a lot: I just tried out version 3.1.2
with dictionary version 3.1
and I got the results you pointed out: same precision, half loading time, half used heap (6 Gb vs 12 Gb for general dictionary).
Though, (
is still recognised as [
with tag -LRB-
(and similarly )
and ]
as -RRB-
) while they should recognised as -LRB-
and -LSB-
(-RRB-
and -RSB
) respectively.
Just in case, I couldn't express myself clear enough, consider the following:
TEXT | POS_TAG | Description |
---|---|---|
( |
-LRB- |
Left round bracket |
[ |
-LSB- |
Left squared bracket |
{ |
-LCB- |
Left curly bracket |
) |
-RRB- |
Right round bracket |
] |
-RSB- |
Right squared bracket |
} |
-RCB- |
Right curly bracket |
Sometimes it is important to distinguish between them, for instance dealing with Wikipedia pages: round brackets are often used to punctualness some concept, while squared bracket are used to introduce references of footnotes. The following sentence, for instance is a good example of both:
Credit (from Latin credere translation. "to believe") encompasses any form of deferred payment.[1]
Congratulations again for the wonderful software!
Stefano,
The POS tags for the left and right * brackets are -LRB- and -RRB- by the Penn Treebank guidelines so that’s why ClearNLP is producing those tags. They can be distinguished from their word forms though. Please let me know if you have more suggestions on ClearNLP. Thanks!
best,
Jinho
On May 11, 2015, at 8:54 AM, Stefano Bragaglia notifications@github.com<mailto:notifications@github.com> wrote:
Thanks a lot: I just tried out version 3.1.2 with dictionary version 3.1 and I got the results you pointed out: same precision, half loading time, half used heap (6 Gb vs 12 Gb for general dictionary).
Though, ( is still recognised as [ with tag -LRB- (and similarly ) and ] as -RRB-) while they should recognised as -LRB- and -LSB- (-RRB- and -RSB) respectively.
Just in case, I couldn't express myself clear enough, consider the following:
TEXT POS_TAG Description ( -LRB- Left round bracket [ -LSB- Left squared bracket { -LCB- Left curly bracket ) -RRB- Right round bracket ] -RSB- Right squared bracket } -RCB- Right curly bracket
Sometimes it is important to distinguish between them, for instance dealing with Wikipedia pages: round brackets are often used to punctualness some concept, while squared bracket are used to introduce references of footnotes. The following sentence, for instance is a good example of both:
Credit (from Latin credere translation. "to believe") encompasses any form of deferred payment.[1]
Congratulations again for the wonderful software!
— Reply to this email directly or view it on GitHubhttps://github.com/clir/clearnlp/issues/3#issuecomment-100897300.
This e-mail message (including any attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this message (including any attachments) is strictly prohibited.
If you have received this message in error, please contact the sender by reply e-mail message and destroy all copies of the original message (including attachments).
Hi, I recently came across ClearNLP and decided to give it a try. I've used other NLP frameworks in the past and I am positively impressed by the overall quality of yours. I've found some little bugs that I'm going to list below, hoping they will help you to make ClearNLP become even better.
I've used the dataset of sentences here and I analysed them as you suggest to do when you show how to use the API. I arranged the results by _POS Tag_, counting the occurrences of word (case insensitive). These are the results that I got:
''
"
(18),
,
(318),—
(5),/
(3),worldwide.
(1),–
(1)-LRB-
[
(62),(
(31)-RRB-
]
(62),)
(31).
.
(151),...
(1),?
(1):
:
(9),;
(3),—
(2)ADD
ɡreɪt
(1)CC
and
(166),or
(47),but
(7),either
(5),Yet
(1),both
(1),ənd
(1)CD
1
(15),one
(15),3
(9),2
(8),4
(5),5
(4),6
(4),8
(4),million
(3),two
(3),10
(2),11
(2),12
(2),14
(2),1886
(2),7
(2),9
(2),four
(2),three
(2),1/100
(1),13
(1),1397.
(1),1472.
(1),15
(1),16
(1),17
(1),18
(1),190
(1),1901
(1),1926
(1),1952
(1),1986.
(1),20
(1),20-30
(1),2000
(1),2010
(1),243,610
(1),500
(1),64.1
(1),94,060
(1),billion
(1),eight
(1),fourteen
(1)DT
the
(269),A
(122),an
(17),this
(16),These
(10),another
(7),some
(6),Both
(4),that
(4),each
(3),any
(2),those
(2),all
(1),no
(1)EX
there
(2)FW
i.e.
(2),commodity.
(1),etc
(1),societies.
(1)HYPH
-
(27),–
(2),—
(1)IN
of
(189),in
(116),to
(52),as
(35),for
(34),by
(31),with
(22),on
(20),from
(17),between
(12),at
(10),while
(8),during
(5),that
(5),Due
(3),if
(3),into
(3),than
(3),through
(3),under
(3),within
(3),about
(2),against
(2),along
(2),over
(2),since
(2),throughout
(2),Although
(1),Unlike
(1),after
(1),among
(1),because
(1),off
(1),once
(1),pairwise
(1),so
(1),towards
(1),underlie
(1),until
(1),upon
(1),whether
(1),without
(1),ˈbrɪtən
(1)JJ
social
(49),other
(17),financial
(15),such
(13),many
(10),public
(8),first
(6),German
(5),important
(5),ancient
(4),complex
(4),global
(4),modern
(4),same
(4),specific
(4),academic
(3),amino
(3),certain
(3),computational
(3),current
(3),different
(3),early
(3),equal
(3),genetic
(3),governmental
(3),individual
(3),major
(3),scientific
(3),various
(3),whole
(3),14th
(2),20th
(2),Further
(2),Human
(2),New
(2),accessible
(2),behavioral
(2),biological
(2),constitutional
(2),dependent
(2),discrete
(2),general
(2),influential
(2),international
(2),judicial
(2),large
(2),later
(2),latter
(2),long
(2),metabolic
(2),national
(2),nonsexual
(2),northern
(2),particular
(2),practical
(2),principal
(2),real
(2),regulatory
(2),rich
(2),several
(2),sociological
(2),sovereign
(2),statistical
(2),subject
(2),theoretical
(2),urban
(2),15th
(1),16th
(1),17th
(1),18th
(1),22nd-most
(1),Abnormal
(1),American
(1),British
(1),European
(1),Italian
(1),Judicious
(1),Short
(1),Strong
(1),abstract
(1),active
(1),additional
(1),adjacent
(1),administrative
(1),alternative
(1),asymmetric
(1),available
(1),average
(1),basic
(1),callable
(1),central
(1),centralized
(1),chemical
(1),common
(1),consistent
(1),contemporary
(1),critical
(1),cultural
(1),curious
(1),detailed
(1),devoid
(1),dimensional
(1),direct
(1),dyadic
(1),eastern
(1),electrical
(1),empirical
(1),epistemological
(1),executable
(1),executive
(1),famous
(1),firm
(1),flexible
(1),formal
(1),former
(1),fractional
(1),fundamental
(1),generic
(1),geographic
(1),great
(1),half
(1),hermeneutic
(1),institutional
(1),interconnected
(1),interdisciplinary
(1),interested
(1),internal
(1),interpersonal
(1),interpretative
(1),intractable
(1),key
(1),legal
(1),legislative
(1),like
(1),linear
(1),linguistic
(1),liquid
(1),local
(1),locomotive
(1),logistical
(1),macro
(1),mammalian
(1),mathematical
(1),meaningless
(1),medical
(1),medieval
(1),methodical
(1),metropolitan
(1),micro
(1),mid-twentieth
(1),military
(1),minimum
(1),monetary
(1),much
(1),multinational
(1),multiple
(1),nascent
(1),natural
(1),next
(1),non-governmental
(1),non-peptide
(1),non-profit
(1),notional
(1),nucleotide
(1),only
(1),open
(1),opposite
(1),parliamentary
(1),penal
(1),peptide
(1),pervasive
(1),philosophic
(1),physical
(1),populous
(1),posttranslational
(1),powerful
(1),prime
(1),principled
(1),professional
(1),prosthetic
(1),qualitative
(1),quantitative
(1),recent
(1),regional
(1),responsible
(1),retail
(1),rigorous
(1),second
(1),self-powered
(1),small
(1),socialist
(1),societal
(1),spatial
(1),square
(1),stable
(1),standard
(1),structural
(1),structured
(1),suburban
(1),symmetric
(1),systematic
(1),threaded
(1),traditional
(1),true
(1),typical
(1),unneeded
(1),unstable
(1),usable
(1),useful
(1),vast
(1),vehicular
(1),visual
(1),western
(1),wheeled
(1),wide
(1)JJR
more
(2),larger
(1),less
(1),smaller
(1)JJS
largest
(6),most
(5),oldest
(3),best
(2),least
(1)MD
can
(16),may
(13),should
(2),might
(1),will
(1)NFP
(s)
(1),:-
(1),;-)
(1)NN
network
(19),credit
(16),theory
(14),bank
(13),Computer
(11),graph
(11),money
(11),banking
(9),car
(9),loan
(9),policy
(9),programming
(9),sociology
(9),analysis
(8),behavior
(8),century
(8),protein
(8),subroutine
(8),account
(7),study
(7),system
(7),activity
(6),amount
(6),part
(6),program
(6),sequence
(6),structure
(6),use
(6),world
(6),acid
(5),borrower
(5),capital
(5),lender
(5),research
(5),science
(5),set
(5),society
(5),state
(5),term
(5),transportation
(5),unit
(5),amino
(4),area
(4),code
(4),computation
(4),cost
(4),country
(4),debt
(4),default
(4),entity
(4),interest
(4),land
(4),number
(4),party
(4),place
(4),time
(4),vehicle
(4),Internet
(3),air
(3),centrality
(3),change
(3),communication
(3),concept
(3),context
(3),development
(3),form
(3),function
(3),health
(3),interaction
(3),level
(3),luxury
(3),market
(3),nature
(3),north
(3),object
(3),period
(3),protection
(3),seller
(3),subprogram
(3),task
(3),value
(3),variety
(3),year
(3),Web
(2),action
(2),agency
(2),animal
(2),approach
(2),automobile
(2),balance
(2),basis
(2),border
(2),city
(2),class
(2),customer
(2),date
(2),dei
(2),deposit
(2),discipline
(2),driving
(2),example
(2),field
(2),finance
(2),folding
(2),freedom
(2),fuel
(2),gasoline
(2),gene
(2),history
(2),information
(2),insurance
(2),island
(2),knowledge
(2),language
(2),law
(2),lending
(2),life
(2),lifespan
(2),maintenance
(2),manufacturer
(2),method
(2),mobility
(2),name
(2),parking
(2),physics
(2),pollution
(2),polypeptide
(2),preference
(2),range
(2),rate
(2),receivable
(2),representation
(2),response
(2),return
(2),risk
(2),role
(2),sense
(2),spread
(2),swap
(2),trade
(2),trading
(2),transport
(2),usage
(2),way
(2),1–2
(1),2007–200
(1),Archaeology
(1),Bisociality
(1),DNA
(1),Government
(1),List
(1),Monosociality
(1),P
(1),Road
(1),Subject
(1),ability
(1),access
(1),acquisition
(1),addition
(1),agent
(1),aggression
(1),altruism
(1),application
(1),archaea
(1),array
(1),article
(1),asset
(1),association
(1),auto
(1),baggage
(1),barter
(1),behaviour
(1),biology
(1),birth
(1),body
(1),bond
(1),branch
(1),brand
(1),business
(1),buyer
(1),call
(1),care
(1),case
(1),cause
(1),cell
(1),cell.
(1),centre
(1),chain
(1),coast
(1),combustion
(1),comfort
(1),complexity
(1),computing
(1),conditioning
(1),consumer
(1),continuation
(1),contract
(1),contrast
(1),convenience.
(1),creation
(1),creator
(1),credere
(1),creditor
(1),creditworthiness
(1),crisis
(1),crossover
(1),deal
(1),debate
(1),debtor
(1),decision
(1),defence
(1),deflagration
(1),demand
(1),depreciation
(1),description
(1),design
(1),destruction
(1),deviance
(1),diesel
(1),disease
(1),disorder
(1),distinction
(1),division
(1),east
(1),economy
(1),edge
(1),education
(1),end
(1),engine
(1),engineering
(1),entertainment
(1),equity
(1),ethanol
(1),everyone
(1),evidence
(1),exchange
(1),execution
(1),existence
(1),expression
(1),feasibility
(1),fee
(1),flow
(1),focus
(1),foundation
(1),funding
(1),gas
(1),generation
(1),glossary
(1),goal
(1),governance.
(1),grain
(1),group
(1),guide
(1),headquarters
(1),hierarchy
(1),holder
(1),importance
(1),incentive
(1),independence
(1),individual
(1),influence
(1),infrastructure
(1),injury
(1),institution
(1),instruction
(1),intercity
(1),intermediary
(1),interplay
(1),interurban
(1),invention
(1),inventor
(1),investigation
(1),job
(1),justice
(1),legislation
(1),leisure
(1),liability
(1),liquidity
(1),location
(1),locomotive
(1),machinery
(1),macro.
(1),mainland
(1),making
(1),manner
(1),material
(1),matter
(1),meaning
(1),mechanism
(1),mechanization
(1),memory
(1),merchant
(1),modelling
(1),modification
(1),monarch
(1),monarchy
(1),motor
(1),motorcar
(1),movement
(1),navigation
(1),nb
(1),note
(1),nothing
(1),oder
(1),order
(1),organisation
(1),organization
(1),par
(1),parenthesis
(1),particle
(1),passenger
(1),payment
(1),payment.
(1),percent
(1),person
(1),perspective
(1),petrol
(1),physiology
(1),point
(1),popularity
(1),population
(1),portion
(1),power
(1),practice
(1),predation
(1),premium
(1),price
(1),principal
(1),problem
(1),procedure
(1),process
(1),processing
(1),production
(1),pronunciation
(1),proof
(1),prototype
(1),provider
(1),provision
(1),psychology
(1),purest
(1),pyrrolysine
(1),quality
(1),rail
(1),railroad
(1),realisation
(1),reallocation
(1),receiver
(1),regard
(1),regulation
(1),relation
(1),reliability.
(1),religion
(1),repayment
(1),representation.
(1),reputation
(1),reserve
(1),responsibility
(1),result
(1),revenue
(1),rise
(1),routine
(1),safety
(1),scale
(1),scapegoating
(1),science.
(1),scientist
(1),sea
(1),seating
(1),secularisation
(1),selling
(1),sentence
(1),service
(1),sex
(1),sexuality
(1),sharing
(1),sheet
(1),size
(1),slogan
(1),software
(1),solution
(1),source
(1),south
(1),space
(1),sq
(1),stability
(1),step
(1),storage
(1),stratification
(1),structure.
(1),support
(1),syntax
(1),synthesis
(1),systems.
(1),tax
(1),today
(1),tool
(1),topic
(1),traffic
(1),transfer
(1),translation
(1),travel
(1),trust
(1),turn
(1),turnover
(1),type
(1),umbrella
(1),understanding
(1),vertex
(1),wealth
(1),welfare
(1),wellbeing
(1),word
(1),world.
(1)NNP
Benz
(9),UK
(6),United
(6),Europe
(5),Mercedes
(5),Britain
(4),Ireland
(4),Italy
(4),Kingdom
(4),Daimler
(3),Great
(3),Northern
(3),Siena
(3),China
(2),Empire
(2),English
(2),Financial
(2),Florence
(2),India
(2),Ireland.
(2),Karl
(2),London
(2),Medici
(2),Monte
(2),Motorwagen
(2),Ocean
(2),Paschi
(2),Patent
(2),Renaissance
(2),Republic
(2),Roman
(2),Sea
(2),States
(2),Union
(2),di
(2),—
(2),AG
(1),Allen
(1),America
(1),Amsterdam
(1),Analysis
(1),Association
(1),Assyria
(1),Atlantic
(1),Audi
(1),BC
(1),BMW
(1),Babylonia
(1),Baden
(1),Bank
(1),Bardi
(1),Basel
(1),Belfast
(1),Berenberg
(1),Berenbergs
(1),Beste
(1),Big
(1),Cardiff
(1),Channel
(1),ClearNLP
(1),Company
(1),Comparing
(1),Crown
(1),Das
(1),David
(1),Douglas
(1),Dutch
(1),Edinburgh
(1),Elizabeth
(1),England
(1),Euler
(1),Europe.
(1),European
(1),Eurostat.
(1),Falkland
(1),February
(1),Ford
(1),Franklin
(1),Gale
(1),Genoa
(1),Georg
(1),Germany
(1),Gesellschaft
(1),Gibraltar
(1),Gill
(1),Giovanni
(1),Greece
(1),Guernsey
(1),Gurusamy
(1),Holy
(1),II
(1),Indian
(1),Irish
(1),Isle
(1),Jacob
(1),Jersey
(1),Königsberg
(1),Latin
(1),Listeni
(1),Man
(1),Management
(1),Maurice
(1),Medicis
(1),Model
(1),Moreno
(1),Motor
(1),Motoren
(1),North
(1),Overseas
(1),Peruzzi
(1),Policy
(1),Public
(1),Queen
(1),Scotland
(1),Seven
(1),Simmel
(1),Soviet
(1),Stanley
(1),Stuttgart
(1),T
(1),Territory
(1),U.S.
(1),Venice
(1),Wales
(1),Western
(1),Wheeler
(1),Wide
(1),Wilkes
(1),World
(1),Württemberg
(1),mi
(1),selenocysteine
(1),ˈaɪərlənd
(1),ˈnɔrðərn
(1),ˈproʊti.ɨnz
(1),ˈproʊˌtiːnz
(1),–
(1)NNPS
Systems
(2),Accords
(1),Bridges
(1),Fuggers
(1),Islands
(1),NICs.
(1),Rothschilds
(1),Services
(1),Territories
(1),Welsers
(1)NNS
networks
(10),Proteins
(9),banks
(8),cars
(8),systems
(8),relations
(7),residues
(7),benefits
(5),countries
(5),institutions
(5),methods
(5),objects
(5),subroutines
(5),vehicles
(5),Applications
(4),loans
(4),people
(4),sciences
(4),Behaviors
(3),Examples
(3),approaches
(3),cities
(3),definitions
(3),economies
(3),fields
(3),functions
(3),goods
(3),graphs
(3),markets
(3),mathematics
(3),members
(3),programs
(3),regulations
(3),resources
(3),roads
(3),species
(3),structures
(3),theories
(3),vertices
(3),acids
(2),actors
(2),automakers
(2),bonds
(2),branches
(2),challenges
(2),concepts
(2),controls
(2),costs
(2),decisions
(2),disciplines
(2),dynamics
(2),edges
(2),entities
(2),fuels
(2),funds
(2),genes
(2),groups
(2),humans
(2),innovations
(2),issues
(2),languages
(2),laws
(2),libraries
(2),lines
(2),molecules
(2),nodes
(2),operations
(2),opportunities
(2),organisations
(2),organisms
(2),origins
(2),others
(2),parts
(2),patterns
(2),policies
(2),problems
(2),properties
(2),researchers
(2),restrictions
(2),scholars
(2),services
(2),students
(2),techniques
(2),terms
(2),things
(2),transactions
(2),1930s
(1),1950s
(1),1980s.
(1),Movements
(1),Subprograms
(1),accidents
(1),actions
(1),activities
(1),acts
(1),administrations
(1),administrators
(1),affiliations
(1),algorithms
(1),alternatives
(1),analysis.
(1),aqueducts
(1),aspects
(1),assets
(1),automobiles
(1),bits
(1),books
(1),borrowers.
(1),buses
(1),calls
(1),capitals
(1),carriages
(1),carts
(1),cells
(1),centuries
(1),chains
(1),changes
(1),citizenship.
(1),classes
(1),coaches
(1),cofactors
(1),complexes
(1),computations
(1),computers
(1),constitutions
(1),contracts
(1),courses
(1),covenants
(1),customs
(1),days
(1),deaths
(1),decades
(1),deposits
(1),developers
(1),developments
(1),dynasties
(1),economics
(1),educators
(1),emoticons
(1),failures
(1),families
(1),farmers
(1),fees
(1),fields.
(1),focuses
(1),goods.
(1),graphics
(1),ideas
(1),implications
(1),indicators
(1),individuals
(1),inhabitants.
(1),institutions.
(1),instructions
(1),instruments
(1),interpretations
(1),investors
(1),islands
(1),kilometres
(1),lawmakers
(1),lenders
(1),liabilities
(1),macromolecules
(1),magnates
(1),makers
(1),managers
(1),masses
(1),materials
(1),measures
(1),merchants
(1),minutes
(1),mɛʁˈt͡seːdəs
(1),nations
(1),networks.
(1),nichts
(1),numbers
(1),obligations
(1),organizations
(1),origin.
(1),paradigms
(1),parties
(1),passengers
(1),peptides
(1),pipes
(1),places
(1),planners
(1),points
(1),politicians
(1),polypeptides
(1),powers
(1),practices
(1),practitioners
(1),priorities
(1),procedures
(1),processes
(1),professors
(1),railways
(1),reactions
(1),relationships
(1),repairs
(1),representatives
(1),requirements
(1),roots
(1),routes
(1),savers
(1),savings
(1),schools
(1),services.
(1),sexes
(1),shyness.
(1),smileys
(1),societies
(1),sociograms
(1),sociologists
(1),spaces
(1),spheres
(1),spreaders
(1),standards
(1),statistics
(1),stimuli
(1),streets
(1),structures.
(1),subfields
(1),subjects
(1),substrates
(1),systems.
(1),tabulations
(1),tasks
(1),taxes
(1),telecommunications
(1),temples
(1),ties
(1),times
(1),topics
(1),traders
(1),transfers
(1),triads
(1),trucks
(1),turns
(1),types
(1),universities
(1),variations
(1),warming.
(1),ways
(1),wheels
(1),workers
(1),years
(1)POS
's
(10),'
(1),ˈbɛnt͡s
(1)PRP
It
(10),they
(8),itself
(4),them
(4),I
(3),One
(1)PRP$
its
(15),their
(6)RB
also
(18),not
(9),often
(7),commonly
(5),generally
(4),directly
(3),only
(3),primarily
(3),rapidly
(3),sometimes
(3),then
(3),together
(3),widely
(3),However
(2),Still
(2),about
(2),as
(2),back
(2),e.g.
(2),first
(2),highly
(2),mathematically
(2),much
(2),putatively
(2),rather
(2),thereby
(2),usually
(2),well
(2),11th
(1),78th
(1),Apart
(1),Conversely
(1),Further
(1),Once
(1),Perhaps
(1),Shortly
(1),after
(1),analytically
(1),around
(1),basically
(1),broadly
(1),chemically
(1),closely
(1),computationally
(1),continuously
(1),correctly
(1),dramatically
(1),effectively
(1),efficiently
(1),especially
(1),essentially
(1),even
(1),far
(1),flexibly
(1),formerly
(1),gradually
(1),immediately
(1),increasingly
(1),indirectly
(1),inherently
(1),initially
(1),instead
(1),just
(1),loosely
(1),necessarily
(1),normally
(1),notably
(1),now
(1),principally
(1),rarely
(1),respectively
(1),second
(1),separately
(1),substantially
(1),super
(1),typically
(1),ultimately
(1),universally
(1),up
(1)RBR
more
(5),Later
(1),less
(1),longer
(1)RBS
most
(3)RP
up
(2)SYM
/
(4)TO
to
(32)VB
be
(23),include
(5),Refer
(3),see
(3),have
(2),pay
(2),repay
(2),achieve
(1),act
(1),become
(1),believe
(1),cause
(1),charge
(1),climate
(1),conduct
(1),consist
(1),denote
(1),develop
(1),distinguish
(1),encourage
(1),engage
(1),ensure
(1),examine
(1),exist
(1),form
(1),identify
(1),induce
(1),let
(1),locate
(1),measure
(1),model
(1),move
(1),operate
(1),place
(1),provide
(1),reduce
(1),reflect
(1),reimburse
(1),require
(1),return
(1),run
(1),serve
(1),solve
(1),specify
(1),study
(1),support
(1),work
(1)VBD
was
(6),were
(4),had
(3),called
(2),caused
(2),made
(2),took
(2),accepted
(1),added
(1),appeared
(1),applied
(1),authored
(1),became
(1),began
(1),built
(1),carried
(1),changed
(1),deposited
(1),did
(1),dominated
(1),emerged
(1),led
(1),provoked
(1),referred
(1),replaced
(1),termed
(1)VBG
including
(5),being
(3),making
(3),According
(2),Acting
(2),developing
(2),existing
(2),increasing
(2),lending
(2),maintaining
(2),using
(2),writing
(2),Lying
(1),acquiring
(1),analyzing
(1),ascending
(1),catalyzing
(1),compiling
(1),comprising
(1),concerning
(1),consisting
(1),containing
(1),contributing
(1),corresponding
(1),creating
(1),describing
(1),disposing
(1),emphasizing
(1),encompassing
(1),establishing
(1),explaining
(1),funding
(1),gaining
(1),generating
(1),granting
(1),identifying
(1),implementing
(1),issuing
(1),living
(1),loaning
(1),meaning
(1),operating
(1),reaching
(1),refining
(1),replicating
(1),resolving
(1),responding
(1),resulting
(1),taking
(1),transporting
(1),underlying
(1),varying
(1)VBN
used
(11),known
(8),based
(6),called
(6),been
(3),considered
(3),developed
(3),directed
(3),made
(3),provided
(3),applied
(2),attached
(2),credited
(2),defined
(2),degraded
(2),designed
(2),encoded
(2),estimated
(2),expanded
(2),followed
(2),performed
(2),recorded
(2),regarded
(2),traded
(2),accepted
(1),added
(1),adopted
(1),affected
(1),associated
(1),authorized
(1),balanced
(1),blamed
(1),bonded
(1),closed
(1),coded
(1),collected
(1),composed
(1),constructed
(1),contrasted
(1),deferred
(1),denominated
(1),denoted
(1),deposited
(1),derived
(1),described
(1),devolved
(1),dictated
(1),disputed
(1),divided
(1),done
(1),drawn
(1),electrified
(1),embodied
(1),employed
(1),enforced
(1),equipped
(1),established
(1),evidenced
(1),evolved
(1),extended
(1),formalized
(1),formed
(1),fueled
(1),funded
(1),given
(1),headquartered
(1),institutionalised
(1),intended
(1),lent
(1),manufactured
(1),measured
(1),misfolded
(1),modified
(1),obligated
(1),observed
(1),organized
(1),oriented
(1),owed
(1),packaged
(1),perceived
(1),played
(1),powered
(1),promulgated
(1),propelled
(1),recycled
(1),referenced
(1),regulated
(1),related
(1),risen
(1),seen
(1),started
(1),studied
(1),surrounded
(1),taken
(1),targeted
(1),traced
(1),transcribed
(1),undirected
(1),used.
(1),weighed
(1),withdrawn
(1)VBP
are
(39),include
(7),have
(6),focus
(2),perform
(2),'m
(1),add
(1),associate
(1),connect
(1),define
(1),depend
(1),differ
(1),draw
(1),emphasize
(1),exchange
(1),exist
(1),hold
(1),identify
(1),increase
(1),provoke
(1),study
(1),value
(1),viewpoint
(1)VBZ
is
(80),has
(10),allows
(3),provides
(3),considers
(2),describes
(2),does
(2),includes
(2),receives
(2),refers
(2),represents
(2),specifies
(2),takes
(2),uses
(2),alters
(1),arranges
(1),begins
(1),behaves
(1),borrows
(1),concerns
(1),consists
(1),contains
(1),covers
(1),creates
(1),delivers
(1),dependencies
(1),detects
(1),determines
(1),encompasses
(1),entails
(1),explores
(1),focuses
(1),forms
(1),generates
(1),happens
(1),informs
(1),investigates
(1),involves
(1),lies
(1),means
(1),oligopeptides
(1),pays
(1),permits
(1),ranges
(1),results
(1),shares
(1),shows
(1),specializes
(1),suggests
(1),traces
(1),varies
(1)WDT
which
(31),that
(17)WP
What
(2),who
(1)WRB
where
(2),wherever
(1)XX
(s)
(1)"
(17)And here are the bugs:
.getWordForm()
,.getSimplifiedWordForm()
or.getLowerSimplifiedWordForm()
, the tokens at the end of a sentence (or, the tokens followed by a DOT.
token) get the.
appended. Examples:1986.
,worldwide.
, etc. Is this a bug or is it done on purpose?[
and(
become-LRB-
, while]
and)
become-RRB-
And the following are probably only due to the specific set of documents used for training (so not really bugs):
i.e.
,etc
,governance
,societies
are recognised as Foreign WordsFW
worldwide
is recognised as COMMA,
:-P
is not recognised as a smileyAs a side note, I suggest to always load all the files for the default dictionary even if it eats a lot of memory (12Gb on my machine) and it takes quite some time when you can't train your own dictionary, because the quality of the NLP improves significantly.
I hope this helps and thanks for maintaining this framework and keep up the great work!