deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.9k stars 602 forks source link

--metadata flag? #47

Open deanmalmgren opened 10 years ago

deanmalmgren commented 10 years ago

@mubaldino mentioned this in #18 but I thought I'd open a separate issue to have a more focused conversation on this particular feature

Other tools, such as Tika, also extract metadata that is embedded in the document. Is this something that we should also (optionally) extract with textract?

From the outset, the goal of this project has been to provide useful text extraction upstream of any subsequent natural language processing, analysis, and modeling. To the extent that metadata is also important for such applications (I've certainly used metadata in my projects before), I'm completely open to adding this functionality but I do have a strong opinion that parsers should not be required to extract metadata. The most important first step is to extract the text content; metadata can always be extracted later.

If we do end up switching to class-based parsers in #39, this would be relatively trivial to implement on a parser-by-parser basis by just adding a metadata method to the parser class.

What do others think about this?

Any thoughts on format (json vs xml vs csv)? My initial preference would be for dictionaries and json but could be convinced otherwise.

bef55 commented 6 years ago

I would love a metadata parser. JSON is easiest. Is this in the works?

deanmalmgren commented 6 years ago

@bef55 not by me; contributions welcome!

bef55 commented 6 years ago

@deanmalmgren I would be glad to if I had the skills, Unfortunately I don't which is how I landed here. Thanks all the same.

mohammedyunus009 commented 5 years ago

@deanmalmgren .I would like to contribute on this issues , Can contact me on mohammedyunus009@gmail.com . It would be a pleasure serving the community.