Open Jasmeet2011 opened 4 years ago
So i have struggled quite a bit looking at resources available online. It appears that Logstash does not have a parser for pdf file. However, the way forward is to use Ruby. Can anyone pl point me in the right direction.
Tried further with some Ruby code
filter{ ruby { code => 'event.set("decoded", Base64.decode64(event.get("[resume]")))' } }
and got this as output
{ "dob" => 1979-12-31T18:30:00.000Z, "resume" => #<Sequel::SQL::Blob:0x7f6 bytes=53980 start="%PDF-1.5\r\n" end="800\r\n%%EOF">, "@version" => "1", "last_name" => "Schmitt", "@timestamp" => 2020-05-06T15:02:00.898Z, "decoded" => "<1u\xE7](n?\xD3\xCA\x97\xBF\t\xABZ\x96\x88?=\xA8\x1E\xB3m\x11\xFC\xB6\xA7\x81\xE9\xD4K\xF4\xAD\xAE\xE7-N\xB7\x9EF\x8A-\xDB\xAD\x11\xFC\xC6\xAB\x90\x89\xDF\xA3\xF3\x1A\xAEG\x9D\xB6\xBB\x9Ezwhn=\xB4\xA1\xB8\xFFO*^\xFC\xF6\xA0z\xCF\xC2\xA2\xE9\xED\xD7\xF2\xA2v\xCD\xF4E\xE9\xDD\xA1\xB8\xF7\xD2\x86\xE3\xFD<\xA9{\xF3\xDA\x81\xEF\xCFj\xB7\xA7\xB7m\x11\xFD\x17\xAC\xA2\xEA\xDCz\xCF\xC5\xA2{\x7F\x17^tG\xF1v\xEFD\x7F\x17\x7FtG\xF1x\xD7\x8D\x11\xFC^u\xEBD\x7F\x17\xAD|\xD1\x1F\xC5\xEFm\xF4G\xF3\xEB\xA1\xC4\x9E\xB7\xF3\xC3\x17\xF4\xDE\xC6\xDF\xC8\x99\xA8\x1E\a\xF2&j\a\x82\xFC\x89\x9A\x81\xE2?1\xE7bh\x1A1\xD3N\xB5\xDB\xBFv\xFC*'\xB5\xE9\xED\xB3\x8D\x11\xFCj\xE8\xBA\x9F\xD3\xCA\x97\xBF\x1A\xBA.\xA7\xF4\xBFN\xB6\xA7\xB2\x96\xABzw2\xFC$\xBF\r\xEB\xE2q\xE4F\a\xF4\xDAn\xCF\xD2\xFD+k\xB9\xCBOj\xB7\xA7\xB6\xCD\x1E\x9D\xDA\e\x8F\x8D(n?\xC5\x8A[^\xAF\xF1ej\xD7\x83y\xCA\x1D{\xF2\xDE\x9E\va\xE3\x9E=\xB2\xDA\xDEjlh\xEAtt\xF0\xD3@\xFC\xF4k\xCCJ\xE9\xFB\f\xFD\xCC\\g\xFC/\xC3\x0EtP\ef\xCE\xA8U\xF2]e\xF6\x13|i\xD2\xBD3A\xBD\xFA\x96\x93\x1C\xC8}<\xD9 E\xA2\e\xE0\xA5\xB7vwJ\xBETF\x9E\xDB\xDE\"b\x10\xAB5vt\xA3\x13SW\x8A\xB3q\xC6\xC7\xD0\xF3w\xFCg2f|\xD5\xEA,\xAC\xA2\x90\v\x9D=)\xE8\x10D^.%\xFE\x94\xDD\xDBp\xBC\x90\xD1\x95\x8C{", "phone" => "(403) 225-5556", "first_name" => "Carine ", "email" => "carine.s@gmail.com\r", "id" => 1 }
The PDF file that i am trying to read from the table still does not show up. Does it mean that the file is not base64 encoded?
Can a 'pdf-reader' Gem be used in this case!
Where am I going wrong. Can anyone help pl.
So i tried various options to index the pdf files stored in Mysql data but could not succeed. The other option i tried was to write the data stream from the sql table as a file (pdf or docx) using a Ruby filter and then to parse the data separately into ES using FScrawler.
I will write the code here, in case it helps someone struggling with similar problem. (Note: columns_charset in the code below may not be used as it doesn't help.)
In the code below, i am extracting the BLOB data stored in column -RESUME and write the Docx/PDF file to the Filesystem.
input { jdbc { jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://localhost:3306/mysqljdbc" jdbc_user => "root" jdbc_password => "XXXXXX" parameters => { "identity" => "4" } schedule => "*/3 * * * *" statement => "SELECT * from candidates where id = :identity" columns_charset => { "resume" => "ASCII-8BIT" } } } filter{ ruby{ init => "require 'base64'" code => ' File.binwrite("doc1.docx",event.get("[resume]")) ' } } output{ elasticsearch { hosts => "http://localhost:9200" index =>"mysqldoc" } stdout { codec => rubydebug } }
Something that still remains unclear to me is that when i see the file indexed in Kibana, I can see the text content of the file with some garbled Headers and Footers with ASCII-8BIT encoding and with UTF-8 encoding, the output is same as that in the stdout console. Even tried opening the original document with a text editor to see if something is visible, but no. All garbled!
I shall be thankful, If someone is able to explain me this behaviour.
The stdout of the "resume" field in the console shows this:
"resume" => #
Can someone also indicate how to use a Ruby filter to convert the Binary data of the doc/pdf file from JDBC connector to a json format that ES can injest.
Ok, I finally managed to write code to read the pdf file using ruby code and write it to ES. ` input { jdbc { jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://localhost:3306/mysqljdbc" jdbc_user => "root" jdbc_password => "root" parameters => { "identity" => "1" } schedule => "/1 " statement => "SELECT from candidates where id = :identity"
}
} filter{ ruby{
code => "
require 'pdf-reader'
require 'json'
# METHOD TO CLEAN THE STRING
def convert(string2)
string3 = string2.gsub(/\s+/,' ').gsub('\n','').gsub(/[[:^ascii:]]/, ' ')
end
File.binwrite('doc1.pdf',event.get('resume'))
puts event.get('resume')
# READ THE PDF FILE AND COVERT TO JSON USING THE FOLLOWING CODE
file_to_be_converted = 'C:/Users/sun/Downloads/elk/logstash-6.8.0/bin/doc1.pdf'
reader = PDF::Reader.new(file_to_be_converted)
text=File.open('c:/Users/sun/Documents/pdftext.txt', 'w')
reader.pages.each do |page|
string2 = JSON.dump(page)
if convert(string2)==nil
text.write(string2)
else
text.write(convert(string2) )
end
File.open('c:/Users/sun/Documents/pdftext.json', 'w'){|f| f.write(convert(string2))}
# DELETE THE RESUME FIELD TO PREVENT THE FIELD CONVERSION ERROR BY ES
event.remove('resume')
# ADD A FIELD CONTAINING THE CONTENTS OF THE JSON OBJECT
event.set('json_content', (convert(string2)))
end
"
}
}
output{
elasticsearch {
hosts => "http://localhost:9200"
index =>"pdfjson"
}
stdout { codec => rubydebug }
}
`
Is there a way to bypass writing the file to the disk and then reading it? Don't know!
Glad if someone can point towards the correct way.
Modified the above code to read the Binary string from the Event API by using this code `# Convert BINARY STRINGS to IO object
file_to_read= StringIO.new(event.get('resume'))
if format=='pdf'
reader = PDF::Reader.new(file_to_read)
reader.pages.each do |page|
string2 = JSON.dump(page.text)
convert(string2)
event.set('json_content', convert(string2))
end`
Now i am trying to send the contents of a Docx stored in Blob field using a DOCX Ruby Gem however it is throwing an error.
Got to read a Docx file using Ruby Gem DOCX. The code is as follows ` File.binwrite('new.docx',event.get('field')) doc= Docx::Document.open('path/new.docx') data=[]
doc.paragraphs.each do |p|
data<<p.text
end
if(doc.tables[0]!=nil) then
doc.tables.each do |table|
table.rows.each do |row| # Row-based iteration
row.cells.each do |cell|
data<< cell.text
end
end
end
else
end
event.set('doc_content', data.to_s)
`
Hello Experts, I request some assistance with regard to the issue described below. I am trying to read data from a MySql database which contains a Blob field containing various file types using Logstash. I want to read the blob data and send it to an Elasticsearch Index. I am able to read the data from Word and Text files. However, pdf files are injested as unreadable data. I have tried reading it as ASCII, UTF-8, etc. Below is the config file
`jdbc {
}`