Open deanmalmgren opened 10 years ago
I would love a metadata parser. JSON is easiest. Is this in the works?
@bef55 not by me; contributions welcome!
@deanmalmgren I would be glad to if I had the skills, Unfortunately I don't which is how I landed here. Thanks all the same.
@deanmalmgren .I would like to contribute on this issues , Can contact me on mohammedyunus009@gmail.com . It would be a pleasure serving the community.
@mubaldino mentioned this in #18 but I thought I'd open a separate issue to have a more focused conversation on this particular feature
Other tools, such as Tika, also extract metadata that is embedded in the document. Is this something that we should also (optionally) extract with textract?
From the outset, the goal of this project has been to provide useful text extraction upstream of any subsequent natural language processing, analysis, and modeling. To the extent that metadata is also important for such applications (I've certainly used metadata in my projects before), I'm completely open to adding this functionality but I do have a strong opinion that parsers should not be required to extract metadata. The most important first step is to extract the text content; metadata can always be extracted later.
If we do end up switching to class-based parsers in #39, this would be relatively trivial to implement on a parser-by-parser basis by just adding a
metadata
method to the parser class.What do others think about this?
Any thoughts on format (json vs xml vs csv)? My initial preference would be for dictionaries and json but could be convinced otherwise.