elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
69 stars 3.5k forks source link

Reading BLOB data containing Pdf, Docx, from MySQL table #11861

Open Jasmeet2011 opened 4 years ago

Jasmeet2011 commented 4 years ago

Hello Experts, I request some assistance with regard to the issue described below. I am trying to read data from a MySql database which contains a Blob field containing various file types using Logstash. I want to read the blob data and send it to an Elasticsearch Index. I am able to read the data from Word and Text files. However, pdf files are injested as unreadable data. I have tried reading it as ASCII, UTF-8, etc. Below is the config file
`jdbc {

jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/mysqljdbc"
jdbc_user => "root"
jdbc_password => "******"
parameters => { "sl" => "6" }
schedule => "*/1 * * * *"
statement => "SELECT * from candidates where id = :sl" 
columns_charset => {
  "resume" => "ASCII-8BIT"
    }

}`

Jasmeet2011 commented 4 years ago

So i have struggled quite a bit looking at resources available online. It appears that Logstash does not have a parser for pdf file. However, the way forward is to use Ruby. Can anyone pl point me in the right direction.

Jasmeet2011 commented 4 years ago

Tried further with some Ruby code filter{ ruby { code => 'event.set("decoded", Base64.decode64(event.get("[resume]")))' } } and got this as output { "dob" => 1979-12-31T18:30:00.000Z, "resume" => #<Sequel::SQL::Blob:0x7f6 bytes=53980 start="%PDF-1.5\r\n" end="800\r\n%%EOF">, "@version" => "1", "last_name" => "Schmitt", "@timestamp" => 2020-05-06T15:02:00.898Z, "decoded" => "<1u\xE7](n?\xD3\xCA\x97\xBF\t\xABZ\x96\x88?=\xA8\x1E\xB3m\x11\xFC\xB6\xA7\x81\xE9\xD4K\xF4\xAD\xAE\xE7-N\xB7\x9EF\x8A-\xDB\xAD\x11\xFC\xC6\xAB\x90\x89\xDF\xA3\xF3\x1A\xAEG\x9D\xB6\xBB\x9Ezwhn=\xB4\xA1\xB8\xFFO*^\xFC\xF6\xA0z\xCF\xC2\xA2\xE9\xED\xD7\xF2\xA2v\xCD\xF4E\xE9\xDD\xA1\xB8\xF7\xD2\x86\xE3\xFD<\xA9{\xF3\xDA\x81\xEF\xCFj\xB7\xA7\xB7m\x11\xFD\x17\xAC\xA2\xEA\xDCz\xCF\xC5\xA2{\x7F\x17^tG\xF1v\xEFD\x7F\x17\x7FtG\xF1x\xD7\x8D\x11\xFC^u\xEBD\x7F\x17\xAD|\xD1\x1F\xC5\xEFm\xF4G\xF3\xEB\xA1\xC4\x9E\xB7\xF3\xC3\x17\xF4\xDE\xC6\xDF\xC8\x99\xA8\x1E\a\xF2&j\a\x82\xFC\x89\x9A\x81\xE2?1\xE7bh\x1A1\xD3N\xB5\xDB\xBFv\xFC*'\xB5\xE9\xED\xB3\x8D\x11\xFCj\xE8\xBA\x9F\xD3\xCA\x97\xBF\x1A\xBA.\xA7\xF4\xBFN\xB6\xA7\xB2\x96\xABzw2\xFC$\xBF\r\xEB\xE2q\xE4F\a\xF4\xDAn\xCF\xD2\xFD+k\xB9\xCBOj\xB7\xA7\xB6\xCD\x1E\x9D\xDA\e\x8F\x8D(n?\xC5\x8A[^\xAF\xF1ej\xD7\x83y\xCA\x1D{\xF2\xDE\x9E\va\xE3\x9E=\xB2\xDA\xDEjlh\xEAtt\xF0\xD3@\xFC\xF4k\xCCJ\xE9\xFB\f\xFD\xCC\\g\xFC/\xC3\x0EtP\ef\xCE\xA8U\xF2]e\xF6\x13|i\xD2\xBD3A\xBD\xFA\x96\x93\x1C\xC8}<\xD9 E\xA2\e\xE0\xA5\xB7vwJ\xBETF\x9E\xDB\xDE\"b\x10\xAB5vt\xA3\x13SW\x8A\xB3q\xC6\xC7\xD0\xF3w\xFCg2f|\xD5\xEA,\xAC\xA2\x90\v\x9D=)\xE8\x10D^.%\xFE\x94\xDD\xDBp\xBC\x90\xD1\x95\x8C{", "phone" => "(403) 225-5556", "first_name" => "Carine ", "email" => "carine.s@gmail.com\r", "id" => 1 } The PDF file that i am trying to read from the table still does not show up. Does it mean that the file is not base64 encoded? Can a 'pdf-reader' Gem be used in this case! Where am I going wrong. Can anyone help pl.

Jasmeet2011 commented 4 years ago
  1. So i tried various options to index the pdf files stored in Mysql data but could not succeed. The other option i tried was to write the data stream from the sql table as a file (pdf or docx) using a Ruby filter and then to parse the data separately into ES using FScrawler.

  2. I will write the code here, in case it helps someone struggling with similar problem. (Note: columns_charset in the code below may not be used as it doesn't help.) In the code below, i am extracting the BLOB data stored in column -RESUME and write the Docx/PDF file to the Filesystem. input { jdbc { jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://localhost:3306/mysqljdbc" jdbc_user => "root" jdbc_password => "XXXXXX" parameters => { "identity" => "4" } schedule => "*/3 * * * *" statement => "SELECT * from candidates where id = :identity" columns_charset => { "resume" => "ASCII-8BIT" } } } filter{ ruby{ init => "require 'base64'" code => ' File.binwrite("doc1.docx",event.get("[resume]")) ' } } output{ elasticsearch { hosts => "http://localhost:9200" index =>"mysqldoc" } stdout { codec => rubydebug } }

  3. Something that still remains unclear to me is that when i see the file indexed in Kibana, I can see the text content of the file with some garbled Headers and Footers with ASCII-8BIT encoding and with UTF-8 encoding, the output is same as that in the stdout console. Even tried opening the original document with a text editor to see if something is visible, but no. All garbled! kibana snap

  4. I shall be thankful, If someone is able to explain me this behaviour. The stdout of the "resume" field in the console shows this: "resume" => #

  5. Can someone also indicate how to use a Ruby filter to convert the Binary data of the doc/pdf file from JDBC connector to a json format that ES can injest.

Jasmeet2011 commented 4 years ago

Ok, I finally managed to write code to read the pdf file using ruby code and write it to ES. ` input { jdbc { jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://localhost:3306/mysqljdbc" jdbc_user => "root" jdbc_password => "root" parameters => { "identity" => "1" } schedule => "/1 " statement => "SELECT from candidates where id = :identity"

}

} filter{ ruby{

 code => "
        require 'pdf-reader'
        require 'json'

#   METHOD TO CLEAN THE STRING
    def convert(string2)
        string3 = string2.gsub(/\s+/,' ').gsub('\n','').gsub(/[[:^ascii:]]/, ' ')
    end

        File.binwrite('doc1.pdf',event.get('resume'))
        puts event.get('resume')
#   READ THE PDF FILE AND COVERT TO JSON USING THE FOLLOWING CODE
        file_to_be_converted = 'C:/Users/sun/Downloads/elk/logstash-6.8.0/bin/doc1.pdf'
        reader = PDF::Reader.new(file_to_be_converted)

        text=File.open('c:/Users/sun/Documents/pdftext.txt', 'w') 
        reader.pages.each do |page|

        string2 = JSON.dump(page)
        if convert(string2)==nil
            text.write(string2)
        else
            text.write(convert(string2) )
        end
        File.open('c:/Users/sun/Documents/pdftext.json', 'w'){|f| f.write(convert(string2))}

#   DELETE THE RESUME FIELD TO PREVENT THE FIELD CONVERSION ERROR BY ES
        event.remove('resume')
#   ADD A FIELD CONTAINING THE CONTENTS OF THE JSON OBJECT
        event.set('json_content', (convert(string2)))
    end
"
}

}

output{

elasticsearch {
hosts => "http://localhost:9200"
index =>"pdfjson"   
}

stdout { codec => rubydebug }
} ` Is there a way to bypass writing the file to the disk and then reading it? Don't know! Glad if someone can point towards the correct way.

Jasmeet2011 commented 4 years ago

Modified the above code to read the Binary string from the Event API by using this code `# Convert BINARY STRINGS to IO object

StringIO allows strings to behave like IOs. This is useful when we want to pass strings into systems that consume streams. This is common in tests where we might inject a StringIO instead of reading an actual file from disk.

file_to_read= StringIO.new(event.get('resume'))
if format=='pdf'

The IO object is passed to the pdf reader instead of a pdf file

            reader = PDF::Reader.new(file_to_read)
        reader.pages.each do |page|
                string2 = JSON.dump(page.text)
            convert(string2)

ADD A FIELD CONTAINING THE CONTENTS OF THE JSON OBJECT

        event.set('json_content', convert(string2))     
        end`

Now i am trying to send the contents of a Docx stored in Blob field using a DOCX Ruby Gem however it is throwing an error.

Jasmeet2011 commented 4 years ago

Got to read a Docx file using Ruby Gem DOCX. The code is as follows ` File.binwrite('new.docx',event.get('field')) doc= Docx::Document.open('path/new.docx') data=[]

Retrieve and display paragraphs

                    doc.paragraphs.each do |p|

SINCE THE PARAGRAPHS ARE IN AN ARRAY, THE OUTPUT IS SENT TO AN ARRAY

                    data<<p.text

event.set('doc_content', data.to_s)

                end

-----------FOR OPENING TABLE CONTENT

        if(doc.tables[0]!=nil) then

Iterate through tables

            doc.tables.each do |table|
                table.rows.each do |row| # Row-based iteration
                    row.cells.each do |cell|
                        data<< cell.text
                    end
                end
            end 
        else
        end
        event.set('doc_content', data.to_s)

`