Tilakkumar / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

PlainTextCollectionReader's progress report is incorrect #72

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Collection readers report progress on the collection through the getProgress 
method. This is 
generally done by reporting the number of completed documents and the total 
number of 
documents. This information is, for example, shown in UIMA's CPE GUI in the 
lower left corner 
during processing.

PlainTextCollectionReader always reports the total number of documents as 
1,000,000, even 
though it may be much lower. This is, at the least, confusing for the user, and 
it may cause more 
severe problems if another component relies on the reported information.

Original issue reported on code.google.com by phwetz...@gmail.com on 16 Mar 2009 at 8:36

GoogleCodeExporter commented 8 years ago
yeah - that is my doing.  I agree that this is pretty lame and I welcome 
suggestions.
 The files iterator has no way of knowing how many files it is going to come across
until it actually goes through them.  When I wrote that I had just been working 
on a
project that read in files from the Enron corpus which has ~500K messages in 
it. 
Pre-calculating how many files there were created an unacceptable startup cost. 

I suppose we could introduce a configurable option that says whether to count 
the
number of files.  I'm not sure what the right behavior should be if this option 
is
set to false.  

Original comment by pvogren@gmail.com on 16 Mar 2009 at 9:22

GoogleCodeExporter commented 8 years ago
Even when processing 500k messages, isn't the time required for the actual 
processing still many times more 
than counting them? If the overall process is going to take three hours, I 
don't see much of a problem with a 
startup cost of a few minutes, if that's what it takes to make the class work 
according to spec and give accurate 
progress information. Were the actual numbers much worse than that?

Especially if the iterator is optimized for just counting files without opening 
them it really shouldn't take too 
long.

Original comment by phwetz...@gmail.com on 16 Mar 2009 at 9:30

GoogleCodeExporter commented 8 years ago
Also, I personally don't like the idea of having configurable options like 
that. IMO the standard classes should 
have a well-defined and delimited purpose and follow the expectations of the 
framework (in this case UIMA). 
This makes them easier to understand and to use. Special cases that require 
custom behavior should be done by 
sub-classing.

Original comment by phwetz...@gmail.com on 16 Mar 2009 at 9:39

GoogleCodeExporter commented 8 years ago
You are correct that the time to read in 500,000 documents is much longer than
counting them.  However, counting them is not an insignificant cost in my 
opinion
even if it is only a few minutes.  There is the annoyance factor caused by 
waiting a
few minutes for the collection reader to initialize only to have an analysis 
engine
in your pipeline throw an exception.  Do you really want to wait to see that 
e.g.
your CPE descriptor is not configured correctly. 

For the collection reader I wrote for the aforementioned project dealing with 
the
Enron corpus I started off by having it count the files just as you suggest. 
However, we were trying to account for every second of cpu time spent and 
adding a
few minutes to our process was simply unacceptable - so I had to rip out the 
file
counting.  

I do not think we should add a clear performance hit when it is not necessary.  

Original comment by pvogren@gmail.com on 16 Mar 2009 at 10:49

GoogleCodeExporter commented 8 years ago
I'm with 3P here. I've been annoyed by the 1,000,000 thing several times now. 
Why
don't we just create a NoProgressPlainTextCollectionReader which doesn't count 
files,
and make PlainTextCollectionReader a subclass of that? That way you can choose
whichever one makes more sense for your data. (I personally expect that 90% of 
our
users will not be processing 500K documents, and will prefer the accurate 
progress
reporting.)

Original comment by steven.b...@gmail.com on 16 Mar 2009 at 10:58

GoogleCodeExporter commented 8 years ago
I have committed changes to PlainTextCollectionReader.  There is now a protected
countFiles method which is called by the initialize method and used to provide a
correct progress.  

Original comment by pvogren@gmail.com on 23 Mar 2009 at 8:29