Closed GoogleCodeExporter closed 8 years ago
yeah - that is my doing. I agree that this is pretty lame and I welcome
suggestions.
The files iterator has no way of knowing how many files it is going to come across
until it actually goes through them. When I wrote that I had just been working
on a
project that read in files from the Enron corpus which has ~500K messages in
it.
Pre-calculating how many files there were created an unacceptable startup cost.
I suppose we could introduce a configurable option that says whether to count
the
number of files. I'm not sure what the right behavior should be if this option
is
set to false.
Original comment by pvogren@gmail.com
on 16 Mar 2009 at 9:22
Even when processing 500k messages, isn't the time required for the actual
processing still many times more
than counting them? If the overall process is going to take three hours, I
don't see much of a problem with a
startup cost of a few minutes, if that's what it takes to make the class work
according to spec and give accurate
progress information. Were the actual numbers much worse than that?
Especially if the iterator is optimized for just counting files without opening
them it really shouldn't take too
long.
Original comment by phwetz...@gmail.com
on 16 Mar 2009 at 9:30
Also, I personally don't like the idea of having configurable options like
that. IMO the standard classes should
have a well-defined and delimited purpose and follow the expectations of the
framework (in this case UIMA).
This makes them easier to understand and to use. Special cases that require
custom behavior should be done by
sub-classing.
Original comment by phwetz...@gmail.com
on 16 Mar 2009 at 9:39
You are correct that the time to read in 500,000 documents is much longer than
counting them. However, counting them is not an insignificant cost in my
opinion
even if it is only a few minutes. There is the annoyance factor caused by
waiting a
few minutes for the collection reader to initialize only to have an analysis
engine
in your pipeline throw an exception. Do you really want to wait to see that
e.g.
your CPE descriptor is not configured correctly.
For the collection reader I wrote for the aforementioned project dealing with
the
Enron corpus I started off by having it count the files just as you suggest.
However, we were trying to account for every second of cpu time spent and
adding a
few minutes to our process was simply unacceptable - so I had to rip out the
file
counting.
I do not think we should add a clear performance hit when it is not necessary.
Original comment by pvogren@gmail.com
on 16 Mar 2009 at 10:49
I'm with 3P here. I've been annoyed by the 1,000,000 thing several times now.
Why
don't we just create a NoProgressPlainTextCollectionReader which doesn't count
files,
and make PlainTextCollectionReader a subclass of that? That way you can choose
whichever one makes more sense for your data. (I personally expect that 90% of
our
users will not be processing 500K documents, and will prefer the accurate
progress
reporting.)
Original comment by steven.b...@gmail.com
on 16 Mar 2009 at 10:58
I have committed changes to PlainTextCollectionReader. There is now a protected
countFiles method which is called by the initialize method and used to provide a
correct progress.
Original comment by pvogren@gmail.com
on 23 Mar 2009 at 8:29
Original issue reported on code.google.com by
phwetz...@gmail.com
on 16 Mar 2009 at 8:36