flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.83k stars 2.09k forks source link

text classification label fine sanity #1312

Closed DecentMakeover closed 4 years ago

DecentMakeover commented 4 years ago

Hi ,Thanks for sharing your work.I am trying to run text classification on the 20newsgroup dataset, but the fscore does not go higher than 60.I just wanted to check if i have formatted the labels correctly,below i have posted the first few elements in the dataset in my csv,could anyone comment on this?

__label__rec.motorcycles    "From: michaelb@compnews.co.uk (Michael Burton)
Subject: Performance Bike Frenzy at Cadwell
Organization: Computer Newspaper Services, Howden, UK.
Lines: 7
NNTP-Posting-Host: cassia.compnews.co.uk
X-Newsreader: Tin 1.1 PL4

Is anyone going to the P.B frenzy at Cadwell park in May.
I am going, but only to watch.

--
    When asked what would I most want to try before doing it, 
                    I said Death. 
"
__label__sci.space  "From: shafer@rigel.dfrf.nasa.gov (Mary Shafer)
Subject: Re: Space Research Spin Off
In-Reply-To: prb@access.digex.com's message of 6 Apr 1993 14:06:57 -0400
Organization: NASA Dryden, Edwards, Cal.
    <pgf.734062799@srl03.cacs.usl.edu>
    <SHAFER.93Apr6094402@rigel.dfrf.nasa.gov> <1psgs1$so4@access.digex.net>
Lines: 38

On 6 Apr 1993 14:06:57 -0400, prb@access.digex.com (Pat) said:

Pat> In article <SHAFER.93Apr6094402@rigel.dfrf.nasa.gov>
Pat> shafer@rigel.dfrf.nasa.gov (Mary Shafer) writes:

>successful we were.  (Mind you, the Avro Arrow and the X-15 were both
>fly-by-wire aircraft much earlier, but analog.)
>

Pat> Gee, I thought the X-15 was Cable controlled.  Didn't one of them
Pat> have a total electrical failure in flight?  Was there machanical
Pat> backup systems?

All reaction-controlled aircraft are fly-by-wire, at least the RCS part
is.  On the X-15 the aerodynamic control surfaces (elevator, rudder, etc)
were conventionally controlled (pushrods and cables) but the RCS jets
were fly-by-wire.

|The NASA habit of acquiring second-hand military aircraft and using
|them for testbeds can make things kind of confusing.  On the other
|hand, all those second-hand Navy planes give our test pilots a chance
|to fold the wings--something most pilots at Edwards Air Force Base
|can't do.  

Pat> What do you mean?  Overstress the wings, and they fail at teh
Pat> joints?

Navy aircraft have folding or sweeping wings, in order to save space
on the hangar deck.  The F-14 wings sweep, all the rest fold the
wingtips up at a joint.

Air Force planes don't have folding wings, since the Air Force has
lots of room.

--
Mary Shafer  DoD #0362 KotFR NASA Dryden Flight Research Facility, Edwards, CA
shafer@rigel.dfrf.nasa.gov                    Of course I don't speak for NASA
 ""A MiG at your six is better than no MiG at all.""  Unknown US fighter pilot
"
severinsimmler commented 4 years ago

The fastText text classification format expects one instance per line, but your file still has line breaks. Your model sees only the text of the lines starting with __label__, everything else (i.e. most of the document) is ignored. However, try replacing the \n with a single space.

DecentMakeover commented 4 years ago

@severinsimmler okay thanks, ill check that.

But isnt it supposed to error out if it does not have one instance per line?

severinsimmler commented 4 years ago

I think each line that does not start with __label__ is considered as not relevant, like e.g. comments, so raising an error is probably not the expected behavior.

DecentMakeover commented 4 years ago

okay,thanks for the help