-
Hi,
is there anyway to collect on wordpress pages? i used the minimal xml file without results.
THX
-
#### Given
A page linking to a [`tel:` URI](https://tools.ietf.org/html/rfc3966):
``` html
Norconex test
Phone Number
```
And the following config:
``` xml
…
niels updated
8 years ago
-
I just take the same [test page](https://herimedia.com/norconex-test.html) from the issue #202
config:
``` xml
Date,Content-Type
…
-
I build my collectors, crawers and commiter by programming, NOT by using xml configurations.
but now existed urls will be committed again when run my collector the second time. is there a flag or see…
-
I am using DomTagger to extract something, like below.
```
```
the result is a piece of html code, can I use a tagger to remo…
-
When I do fetching against http://www.spprec.com/sczw/infodetail/?infoid=5f2c3843-86ce-4f22-a99d-c88e1c838aba&categoryNum=005002005, the returned title and content is in messy code.
-
When I crawl a non-Unicode document (or more precisely: a document in a charset other than my platform default), the crawler correctly detects the document's encoding (by inspecting the "Content-Type"…
niels updated
8 years ago
-
I believe that I am seeing improper behavior of the robots.txt parser / filter.
#### Given
A robots.txt file that disallows access to some parent path but allows access to exceptions within that path…
niels updated
8 years ago
-
Now, DOMTagger handles all document in UTF-8, it's better if user can specify content encoding. by the way, a flag controlling the removal of HTML is also necessary
-
Although config reference suggests that custom sitemapResolverFactory class can be specified, looks like the class attribite is ignored and StandardSitemapResolverFactory is always used.