galterlibrary / digital-repository

DigitalHub - Institutional Repository for Galter Health Sciences
https://digitalhub.northwestern.edu/
5 stars 1 forks source link

Errors when creating files close to 4GB #449

Closed phebal closed 8 years ago

phebal commented 8 years ago

Rails log:

W, [2016-08-09T14:40:39.349714 #25932]  WARN -- : Sufia::GenericFile::Actor::save_and_record_committer Caught RSOLR error #<RSolr::Error::Http: RSolr::Error::Http - 400 Bad Request
Error: {'responseHeader'=>{'status'=>400,'QTime'=>203},'error'=>{'msg'=>'ERROR: [doc=cd4ec986-a619-4ab1-bf1f-1f018bebbce9] Error adding field \'file_size_is\'=\'4019191808\' msg=For input string: "4019191808"','code'=>400}}

URI: http://localhost:8983/solr/staging/update?wt=ruby&softCommit=true
Request Headers: {"Content-Type"=>"text/xml"}
Request Data: "<?xml version=\"1.0\" encoding=\"UTF-8\"?><add><doc><field name=\"system_create_dtsi\">2016-08-09T19:40:10Z</field><field name=\"system_modified_dtsi\">2016-08-09T19:40:38Z</field><field name=\"active_fedora_model_ssi\">GenericFile</field><field name=\"has_model_ssim\">GenericFile</field><field name=\"id\">cd4ec986-a619-4ab1-bf1f-1f018bebbce9</field><field name=\"object_profile_ssm\">{\"id\":\"cd4ec986-a619-4ab1-bf1f-1f018bebbce9\",\"mime_type\":null,\"format_label\":[],\"file_size\":[],\"last_modified\":[],\"filename\":[],\"original_checksum\":[],\"rights_basis\":[],\"copyright_basis\":[],\"copyright_note\":[],\"well_formed\":[],\"valid\":[],\"status_message\":[],\"file_title\":[],\"file_author\":[],\"page_count\":[],\"file_language\":[],\"word_count\":[],\"character_count\":[],\"paragraph_count\":[],\"line_count\":[],\"table_count\":[],\"graphics_count\":[],\"byte_order\":[],\"compression\":[],\"color_space\":[],\"profile_name\":[],\"profile_version\":[],\"orientation\":[],\"color_map\":[],\"image_producer\":[],\"capture_device\":[],\"scanning_software\":[],\"exif_version\":[],\"gps_timestamp\":[],\"latitude\":[],\"longitude\":[],\"character_set\":[],\"markup_basis\":[],\"markup_language\":[],\"bit_depth\":[],\"channels\":[],\"data_format\":[],\"offset\":[],\"frame_rate\":[],\"label\":\"2016-05-27-raspbian-jessie.img.zip\",\"depositor\":\"phb010\",\"arkivo_checksum\":null,\"relative_path\":\"\",\"import_url\":null,\"part_of\":[],\"resource_type\":[],\"title\":[\"2016-05-27-raspbian-jessie.img.zip\"],\"creator\":[\"Hebal, Piotr\"],\"contributor\":[],\"description\":[],\"tag\":[],\"rights\":[],\"publisher\":[],\"date_created\":[],\"date_uploaded\":\"2016-08-09T19:40:10.051+00:00\",\"date_modified\":\"2016-08-09T19:40:10.051+00:00\",\"subject\":[],\"language\":[],\"identifier\":[],\"based_near\":[],\"related_url\":[],\"bibliographic_citation\":[],\"source\":[],\"proxy_depositor\":null,\"on_behalf_of\":null,\"abstract\":[],\"acknowledgments\":[],\"grants_and_funding\":[],\"digital_origin\":[],\"mesh\":[],\"lcsh\":[],\"subject_geographic\":[],\"subject_name\":[],\"page_number\":null,\"page_number_actual\":null,\"doi\":[],\"ark\":[],\"original_publisher\":[],\"private_note\":[],\"batch_id\":\"cf85bbfa-8b62-46a6-8e29-e963e08f3ac7\",\"parent_id\":null,\"combined_file_id\":null}</field><field name=\"depositor_ssim\">phb010</field><field name=\"depositor_tesim\">phb010</field><field name=\"title_tesim\">2016-05-27-raspbian-jessie.img.zip</field><field name=\"title_sim\">2016-05-27-raspbian-jessie.img.zip</field><field name=\"creator_tesim\">Hebal, Piotr</field><field name=\"creator_sim\">Hebal, Piotr</field><field name=\"date_uploaded_dtsi\">2016-08-09T19:40:10Z</field><field name=\"date_modified_dtsi\">2016-08-09T19:40:10Z</field><field name=\"isPartOf_ssim\">cf85bbfa-8b62-46a6-8e29-e963e08f3ac7</field><field name=\"label_tesim\">2016-05-27-raspbian-jessie.img.zip</field><field name=\"file_size_is\">4019191808</field><field name=\"digest_ssim\">urn:sha1:51d5e457ead8278c2626f4a544b4d046846a08df</field><field name=\"content_tesim\">http://localhost:8983/fedora/rest/staging/cd/4e/c9/86/cd4ec986-a619-4ab1-bf1f-1f018bebbce9/content</field><field name=\"label_si\">2016-05-27-raspbian-jessie.img.zip</field><field name=\"edit_access_person_ssim\">phb010</field></doc></add>"

Backtrace: /var/www/apps/galter_digital_repo/shared/gems/ruby/2.2.0/gems/rsolr-1.0.13/lib/rsolr/client.rb:284:in `adapt_response'
/var/www/apps/galter_digital_repo/shared/gems/ruby/2.2.0/gems/rsolr-1.0.13/lib/rsolr/client.rb:190:in `execute'
/var/www/apps/galter_digital_repo/shared/gems/ruby/2.2.0/gems/rsolr-1.0.13/lib/rsolr/client.rb:176:in `send_and_receive'
(eval):2:in `post'
/var/www/apps/galter_digital_repo/shared/gems/ruby/2.2.0/gems/rsolr-1.0.13/lib/rsolr/client.rb:82:in `update'
/var/www/apps/galter_digital_repo/shared/gems/ruby/2.2.0/gems/rsolr-1.0.13/lib/rsolr/client.rb:102:in `add'
/var/www/apps/galter_digital_repo/shared/gems/ruby/2.2.0/gems/active-fedora-9.9.1/lib/active_fedora/solr_service.rb:134:in `add'
/var/www/apps/galter_digital_repo/shared/gems/ruby/2.2.0/gems/active-fedora-9.9.1/lib/active_fedora/indexing.rb:32:in `update_index'
/var/www/apps/galter_digital_repo/shared/gems/ruby/2.2.0/gems/active-fedora-9.9.1/lib/active_fedora/indexing.rb:54:in `create_record'
/var/www/apps/galter_digital_repo/shared/gems/ruby/2.2.0/gems/active-fedora-9.9.1/lib/active_fedora/callbacks.rb:237:in `block (2 levels) in create_record'
/var/www/apps/galter_digital_repo/shared/gems/ruby/2.2.0/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:117:in `call'>

Solr log:

ERROR - 2016-08-09 14:40:39.809; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ERROR: [doc=cd4ec986-a619-4ab1-bf1f-1f018bebbce9] Error adding field 'file_size_is'='4019191808' msg=For input string: "4019191808"
        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:178)
        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
        at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:926)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1080)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:692)
        at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1976)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1476)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:370)
        at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
        at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
        at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
        at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NumberFormatException: For input string: "4019191808"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:583)
        at java.lang.Integer.parseInt(Integer.java:615)
        at org.apache.solr.schema.TrieField.createField(TrieField.java:600)
        at org.apache.solr.schema.TrieField.createFields(TrieField.java:663)
        at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:50)
        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:125)
phebal commented 8 years ago

This is works for 2GB files and not slightly larger ones becuase 2147483648 (bytes) is right at the 31-bit limit for a signed integer type (2147483647) used by Solr for this field. As defined in: sufia-models-6.6.0/app/services/sufia/generic_file_indexing_service.rb Solrizer.solr_name('file_size', STORED_INTEGER) that resolves to files_size_is and is in turn defined in Solr schema.xml as: <dynamicField name="*_is" type="int" stored="true" indexed="false" multiValued="false"/>

Same filed is also defined for Collection objects and thus this will need to be fixed, because collections can contain multiple datasets that add up to an integer longer then 31 bit.

To fix this we need to redefine the field as long (deprecated in later Solr) or trie long. Something like this: Solrizer.solr_name('file_size', Solrizer::Descriptor.new(:long, :stored)) that generates file_size_lts. We then have to re-index all GenericFile and Collection objects in our repository.