freeflowuniverse / crystallib

Apache License 2.0
8 stars 4 forks source link

pdf, docx, html to text fragments #293

Open despiegk opened 8 months ago

despiegk commented 8 months ago

best way how to convert pdf, docx, html to list of text fragments

these text fragments can then be given to vilnus see #292

requirements

todo

ashraffouda commented 7 months ago

since we don't have tools in vlang to do this and also using any of ready to use tools will not be the same on all platforms (windows, linux and osx) I created this tool in rust which can be built and used as a binary https://github.com/ashraffouda/extractor and pr for crystallib is here https://github.com/freeflowuniverse/crystallib/pull/311