DataFog / datafog-python

Open source PII detection and anonymization tool: easy-to-use, configurable, and extensible
https://www.datafog.ai
MIT License
10 stars 3 forks source link

Support for Other doc types #15

Closed bazooka720 closed 7 months ago

bazooka720 commented 7 months ago

Hi: Is there a plan to support other types? e.g. PDF (with images), JSON, PPTX? etc If we need to enable, what's the best way?

sidmohan0 commented 7 months ago

Hey @bazooka720 ! Thanks for submitting this request. Yes there is - we're working on putting together a more visible roadmap w/ these changes and just FYI we'll be targeting Monday ~6AM package releases (starting tomorrow) going forward for a more predictable cadence.

All three that you mentioned were on our internal roadmap; I'll make separate issue tickets for 1) JSON, 2) PPTX and 3) PDF w/ images here to track publicly ; we'll have the team do a little discovery and see if we can fit in these changes by tomorrow! Stay tuned

sidmohan0 commented 7 months ago

@bazooka720 just created this ticket for PDF parsing: https://github.com/DataFog/datafog-python/issues/18 and please feel free to comment there too. do you have any example PDF types (w/ images) that might be good to keep in mind? i.e. medical record images vs textbooks .

bazooka720 commented 7 months ago

Will do! Thanks

bazooka720 commented 7 months ago

One thing for you to consider is to use the unstructured.io library. Seems like combination of its capabilities with presidio might make this pipeline more effective?

sidmohan0 commented 7 months ago

Thanks for the suggestion - let me give them another look. They made some major changes end of last year (moving to API, requiring keys) so I held off then but will add this as an investigation item