-
Hi Guys,
Just wondering for a pdf file, if the text extraction order can be defined? As pointed out [here](https://pdfbox.apache.org/2.0/faq.html#textorder), is there similar setting to adjust the …
-
ubuntu 18.04; java: openjdk version "1.8.0_222"; maven: 3.6.0
The source codes are located at: https://github.com/apache/tika/archive/master.zip
mvn clean install stopped due to the following e…
-
### Problem Description
Currently the connector is extracting the entire raw email and this is in most cases insufficient for correct usage, specially when adding ML pipelines:
![image](https://…
-
When indexing large documents you may hit limits not only on the indexing part, but also when doing searches.
Splitting documents into one entry per page helps slice up large documents into bite-s…
jawiz updated
4 years ago
-
### 请提出你的问题
/Library/Python/3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (5.1.0)/charset_normalizer (3.1.0) doesn't match a supported version!
…
-
Hello Maria, thank you for this very impressive work. I tried to run it in my Mac and I had a few install steps to overcome, which I documented here:
I tried to submit a pull request but I got den…
-
```
Add a metadata crawler for multimedia files which gathers information about
files present in the db.
```
Original issue reported on code.google.com by `hbwint...@gmail.com` on 11 Nov 2011 at 6:2…
-
Hi,
I've gotten tika to work great for a while parsing PDFs - but realised recently that paragraphs longer than 240 characters or so (including spaces) are getting cut off/truncated. Is there any way…
-
Hi,
nlm-ingestor seems promising one but i couldn't able to move forward with the installation issue.
I got the "**_ERROR: Failed to build installable wheels for some pyproject.toml based projects…
-
**Organizational Page**: [AutoMeta](https://github.com/NCEAS/open-science-codefest/wiki/AutoMeta)
**Category**: Coding
**Title**: Automatically extract metadata of R dataframes
**Proposed by**: Ted H…