Closed bazooka720 closed 7 months ago
Hey @bazooka720 ! Thanks for submitting this request. Yes there is - we're working on putting together a more visible roadmap w/ these changes and just FYI we'll be targeting Monday ~6AM package releases (starting tomorrow) going forward for a more predictable cadence.
All three that you mentioned were on our internal roadmap; I'll make separate issue tickets for 1) JSON, 2) PPTX and 3) PDF w/ images here to track publicly ; we'll have the team do a little discovery and see if we can fit in these changes by tomorrow! Stay tuned
@bazooka720 just created this ticket for PDF parsing: https://github.com/DataFog/datafog-python/issues/18 and please feel free to comment there too. do you have any example PDF types (w/ images) that might be good to keep in mind? i.e. medical record images vs textbooks .
Will do! Thanks
One thing for you to consider is to use the unstructured.io library. Seems like combination of its capabilities with presidio might make this pipeline more effective?
Thanks for the suggestion - let me give them another look. They made some major changes end of last year (moving to API, requiring keys) so I held off then but will add this as an investigation item
Hi: Is there a plan to support other types? e.g. PDF (with images), JSON, PPTX? etc If we need to enable, what's the best way?