Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Internal sharepoint website is giving a 403 Forbidden #917

Closed michaelt16 closed 4 months ago

michaelt16 commented 7 months ago

Hi Pascal,

I have a question regarding crawling through an internal sharepoint site. It seems like everytime I go through the internal links I get a 403 forbidden, although I have setup the login aunthentication. Is there anything else I should think about when trying to solve this issue?

For context, I am testing with a depth of 1. Lets say that the first page required a log in as well (which works and able to crawl through it) but when the crawler goes through the sublinks (aka the sharepoint sites) it gives a 403 error although it typically just requires one login to access both.

What are some things I should look at when troubleshooting this? Let me know if configuration or more context is needed.

Thank you -Michael

ohtwadi commented 7 months ago

Hi Michael,

The crawler offers generic NTLM support thanks to the Apache HttpClient library. It supports a few different NTML protocol versions but may not support the one you are using. Details on supported versions: https://hc.apache.org/httpcomponents-client-4.5.x/ntlm.html

You may also want to check with your system administrator to see if there are extra security layers or special configuration requirements you need to be aware of. Maybe you need to pass custom HTTP headers, or go through a proxy (look at <headers> and <proxySettings>).

Finally, if all fails you can try to find out if they offer a way to access your site via other authentication methods or maybe even white-list the crawler IP or some other workaround. There might be other network conditions you are not meeting with NTLM alone.

If you get a specific error from the crawler that suggests a bug, feel free to share your config here and the exact error/logs so we can look for a fix.

michaelt16 commented 7 months ago

Hi,

Thank you for your response. I switched the login to ntlm and it still gave me aa 403 forbidden error unfortunately.

I was thinking of this solution, I am not sure how it is going to work though. I was thinking of using some type of java browse bot and using it alongside norconex. Since I was able to use a browse bot to login to the sharepoint sites and retrieve the html contents.

stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.