dathere / datapusher-plus

A standalone web service that pushes data into the CKAN Datastore fast & reliably. It pushes real good!
GNU Affero General Public License v3.0
29 stars 21 forks source link

Add PDF to supported formats; summarize content and extract tags using LLM #90

Open jqnatividad opened 1 year ago

jqnatividad commented 1 year ago

The legacy Datapusher used to support PDFs, as messytables supported extracting tables from PDFs using pdftables.

That functionality has been removed, as well as Excel support.

We reenabled Excel support in DP+ using qsv.

We should re-enable PDF support again, not to extract tables for now (though there is tabula-rs), but to summarize the content for the Description field and suggest tags.

jqnatividad commented 1 year ago

will be done when qsv describegpt command is done. Though qsv is primarily focused on tabular data, describegpt will have a mode in a later version to summarize PDFs and get get the description and tags for CKAN, which we can use in DP+.

https://github.com/jqnatividad/qsv/pull/1036

cc @rzmk @samibaig

jqnatividad commented 1 year ago

Thinking about it more, PDF summarization is outside the scope of qsv, so we should not add that functionality to qsv.

Though it is still in scope for DP+.