PDF-Crawling - Githubissues

DeuxHuitHuit / algolia-webcrawler

Simple node worker that crawls sitemaps in order to keep an algolia index up-to-date

Other

46 stars 18 forks source link

Open kernpunkt-thermann opened 7 years ago

kernpunkt-thermann commented 7 years ago

Hi,

any ideas/plans about crawling Documents, especially PDFs?

Regards from germany

nitriques commented 7 years ago

Hi!

That would require a lot of work, and it's not planned right now.

I would welcome any PR that tries to add this feature.

RayBB commented 6 years ago

If anyone is trying to do this it looks like this tool might be helpful: https://www.npmjs.com/package/pdf-text-extract

nitriques commented 6 years ago

The problem with that approach is that we cannot use css selector to find the content to index. But it is a start! Thanks for sharing.