-
Since it is not unusual that such types of files don't have title, author, subject, etc., I'm wondering if there is a way of capturing about (say) 100 characters or so from the beginning of the docume…
-
Is it possible to only remove documents with 404 status code?
(and also log the broken link)
-
Hi, I'm trying to gater information about links: the text near che anchor.
I'm using:
norconex-collector-http-2.0.2.zip with openjdk-7
I have this definition:
```
text/htm…
-
I'm using TextBetweenTagger in order to acquire HTML code from crawled pages. The configuration looks like:
```
^.*
.*$
```
However, this has pu…
-
Hi,
I need to use collector-http to get data from several sites which fulfill some regular expression and store them in a database via Java application. Is this possible with collector-http, and how …
-
Lots of 404s? No idea what's going on but @mjgiarlo told me to create a ticket. Here is all the output:
[aheadley actual-sufia 13:53:21]$ bundle exec rake spec
Running RuboCop...
Inspecting 441 files…
-
At committer.commitBatch() function I try to get page's content for database writing.
``` java
public class CustomCommitter extends AbstractMappedCommitter {
...
@Override
protected void comm…
-
Hello!
I write my own committer implementation to put collected pages into MySQL database.
As an example I've taken SolrCommiter - is it a right decision?
So I inherited from AbstractMappedCommitt…
-
After running continuously for quite some time on Windows, the committer will have created a lot of folders (more than 500 000 in my case).
This is extremely performance degrading on Windows.
Curren…
-
I would like to know may collector-http fit requirements for the following task (I appreciate your precious time, and read documentation first, but haven't found some nuances):
A set of hundreds web …