Open at15 opened 8 years ago
<dependency>
<groupId>org.icepdf</groupId>
<artifactId>icepdf-core</artifactId>
<version>5.0.7</version>
</dependency>
貌似可以用来网页看 pdf , 但是不知道哪里用到了,要是可以提取图就好了 /w\
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox-ant</artifactId>
<version>1.8.12</version>
</dependency>
貌似只能提取字
https://github.com/modesty/pdf2json do support parse pdf files, but it does not support links
also we met ruby .... en ....
also a free parse tool ....
btw: doi is can be used to locate papers
well PHP also have library ... https://github.com/smalot/pdfparser though only text is supported
不过,最靠谱的还是这个 https://github.com/coolwanglu/pdf2htmlEX 转成 html 之后 .... 来获取信息 .....
need to use a docker mirror if I want to use this library ...
https://github.com/paquettg/php-html-parser 也可以用php来parse dom....嗯
-jar
it got a old guide in https://pdfclown.files.wordpress.com/2015/02/userguide.pdfCLASSPATH=/home/at15/Downloads/PDFClown/java/pdfclown.lib/build/package
java -jar pdfclown-sample-cli.jar
the sample is up and running when the jar is in classpath. But don't know if will work well for parsing papers.
no php libraries ....
see https://github.com/jaeksoft/opensearchserver/blob/master/pom.xml to find more. Since we need fine grained control over pdf files