Toolchain: CONTENTdm compound PDFs

xing93111 commented 6 years ago

On this page: https://github.com/MarcusBarnes/mik/wiki/Toolchain:-CONTENTdm-compound-PDFs I read for compound PDFs, CdmPhpDocuments class should be used. However, when I run mik

billg@lib10:/data/projects/arca$ ./mik/mik -c ./collections/AUebooks/config.ini
Commencing MIK.
PHP Fatal error:  Uncaught Error: Class 'mik\filegetters\CdmPhpDocuments' not found in /data/projects/arca/mik/mik:170
Stack trace:
#0 {main}
  thrown in /data/projects/arca/mik/mik on line 170

Then, I went to mik/src/filegetters and mik/src/writers. I found a class named CdmPdfDocuments. So I thought maybe there are typos on the document, and changed the class name to CdmPdfDocuments. However, it still does not work. The output gives corrupted PDFs.

This is the collection: http://digicon.athabascau.ca/cdm/landingpage/collection/AUebooks

The following is my configuration ini file:

; Trying out the compound thing

[CONFIG]
config_id = AUebooks
last_updated_on = "2018-10-11"
last_update_by = "hx"

[FETCHER]
class = Cdm
; The alias of the CONTENTdm collection.
alias = AUebooks
ws_url = "http://deck.cs.athabascau.ca/dmwebservices/index.php?q="
; 'record_key' should always be 'pointer' for CONTENTdm fetchers.
record_key = pointer
temp_directory = "/data/projects/arca/tmp"

[METADATA_PARSER]
class = mods\CdmToMods
alias = AUebooks
ws_url = "http://deck.cs.athabascau.ca/dmwebservices/index.php?q="
; Path to the csv file that contains the CONTENTdm to MODS mappings.
mapping_csv_path = '/data/projects/arca/collections/AUebooks/mapping.csv'
; Include the migrated from uri into your generated metadata (e.g., MODS)
include_migrated_from_uri = "http://digicon.athabascau.ca/cdm/ref/collection/"
repeatable_wrapper_elements[] = extension
repeatable_wrapper_elements[] = name
repeatable_wrapper_elements[] = subject
repeatable_wrapper_elements[] = identifier
repeatable_wrapper_elements[] = titleInfo
repeatable_wrapper_elements[] = title
repeatable_wrapper_elements[] = relatedItem
use_nicknames = true

[FILE_GETTER]
class = CdmPdfDocuments
alias = AUebooks
input_directories[] =
ws_url = "http://deck.cs.athabascau.ca/dmwebservices/index.php?q="
utils_url = "http://deck.cs.athabascau.ca/utils/"
temp_directory = "/data/projects/arca/tmp"

[WRITER]
class = CdmPdfDocuments
alias = AUebooks
output_directory = "/data/projects/arca/collections/AUebooks/output"
metadata_filename =
postwritehooks[] = "php extras/scripts/postwritehooks/move_packages_by_extension.php"
postwritehooks[] = "php extras/scripts/postwritehooks/validate_mods.php"
postwritehooks[] = "php extras/scripts/postwritehooks/object_timer.php"
postwritehooks[] = "php extras/scripts/shutdownhooks/delete_temp_files.php"
; Note: During testing we only generate MODS datastreams. In production, comment this line out.
; datastreams[] = MODS

[MANIPULATORS]
; filegettermanipulators[] = "CdmSingleFile|pdf"
; filegettermanipulators[] = "CdmCompound|Document-PDF"
fetchermanipulators[] = "CdmCompound|Document-PDF"
;metadatamanipulators[] = "FilterModsTopic|subject"
;metadatamanipulators[] = "AddContentdmData"
;metadatamanipulators[] = "AddUuidToMods"
;metadatamanipulators[] = "InsertXmlFromTemplate|null0|/Users/brandon/sfuvault/mik/manipulations/athabasca_manipulations/origininfo.xml"
;metadatamanipulators[] = "InsertXmlFromTemplate|null1|/Users/brandon/sfuvault/mik/manipulations/athabasca_manipulations/physicalDescription.xml"

[LOGGING]
path_to_log = "/data/projects/arca/tmp/mik.log"
path_to_manipulator_log = "/data/projects/arca/tmp/manipulator.log"

bondjimbond commented 6 years ago

Thanks for submitting the issue, @xing93111.

Further detail: If MIK is run instead with the class CdmCompound, compound objects are generated with the directory structure of a Book, except each page is a PDF (instead of a TIFF). These PDFs are OK (not corrupt).

As far as we understand, the CdmPdfDocuments class is supposed to merge these page-level PDFs into a single aggregated PDF. The result is a corrupted PDF.

Is there anything wrong with the configuration? Or is there a flaw in the toolchain?

mjordan commented 6 years ago

I can't see anything wrong with the configuration. This particular toolchain relies on CONTENTdm's internal functionality to merge the PDF pages into a single document. It used to work fine - for example the PDFs in https://ecuad.arcabc.ca/islandora/object/ecuad%3Acals were generated using it, with this .ini file: https://github.com/MarcusBarnes/mik/blob/master/extras/samples/calendars_config.ini That said, the filegetter was has probably not been tested since the major code cleanup that happened after SFU used the toolchain.

The code that fetches the assembled PDF content is here. I suggest dumping the value of the URL generated here and then running it using curl to see whether the PDF if produces is corrupted.

xing93111 commented 6 years ago

The configuration file here uses CdmPhpDocuments, but I don't see such class is included in mik toolkit source code. Where can I find the file?

mjordan commented 6 years ago

@xing93111, sorry, that config file was an early one and predates #223. The configuration should use CdmPdfDocuments in lines 22 and 29.

mjordan commented 6 years ago

... and I've just updated https://github.com/MarcusBarnes/mik/wiki/Toolchain:-CONTENTdm-compound-PDFs. Very sorry about that.

xing93111 commented 6 years ago

I used a text editor to open the generated PDF file and found it is not a PDF at all but an XML file. For example, the following is the content of the generated PDF file related to this object: http://digicon.athabascau.ca/cdm/ref/collection/auarchives/id/499

<?xml version="1.0"?>
<cpd>
    <type>Document</type>
  <page>
    <pagetitle>Page 1</pagetitle>
    <pagefile>485.pdf</pagefile>
    <pageptr>484</pageptr>
  </page>
  <page>
    <pagetitle>Page 2</pagetitle>
    <pagefile>486.pdf</pagefile>
    <pageptr>485</pageptr>
  </page>
  <page>
    <pagetitle>Page 3</pagetitle>
    <pagefile>487.pdf</pagefile>
    <pageptr>486</pageptr>
  </page>
  <page>
    <pagetitle>Page 4</pagetitle>
    <pagefile>488.pdf</pagefile>
    <pageptr>487</pageptr>
  </page>
  <page>
    <pagetitle>Page 5</pagetitle>
    <pagefile>489.pdf</pagefile>
    <pageptr>488</pageptr>
  </page>
  <page>
    <pagetitle>Page 6</pagetitle>
    <pagefile>490.pdf</pagefile>
    <pageptr>489</pageptr>
  </page>
  <page>
    <pagetitle>Page 7</pagetitle>
    <pagefile>491.pdf</pagefile>
    <pageptr>490</pageptr>
  </page>
  <page>
    <pagetitle>Page 8</pagetitle>
    <pagefile>492.pdf</pagefile>
    <pageptr>491</pageptr>
  </page>
  <page>
    <pagetitle>Page 9</pagetitle>
    <pagefile>493.pdf</pagefile>
    <pageptr>492</pageptr>
  </page>
  <page>
    <pagetitle>Page 10</pagetitle>
    <pagefile>494.pdf</pagefile>
    <pageptr>493</pageptr>
  </page>
  <page>
    <pagetitle>Page 11</pagetitle>
    <pagefile>495.pdf</pagefile>
    <pageptr>494</pageptr>
  </page>
  <page>
    <pagetitle>Page 12</pagetitle>
    <pagefile>496.pdf</pagefile>
    <pageptr>495</pageptr>
  </page>
  <page>
    <pagetitle>Page 13</pagetitle>
    <pagefile>497.pdf</pagefile>
    <pageptr>496</pageptr>
  </page>
  <page>
    <pagetitle>Page 14</pagetitle>
    <pagefile>498.pdf</pagefile>
    <pageptr>497</pageptr>
  </page>
  <page>
    <pagetitle>Page 15</pagetitle>
    <pagefile>499.pdf</pagefile>
    <pageptr>498</pageptr>
  </page>
</cpd>

mjordan commented 6 years ago

We need to establish that CONTENTdm still supports the ability to join PDF pages into a single multipage PDF file (it may have changed since this code was written). To do that we need to create a request URL using the code below (from here):

            $get_file_url = $this->utilsUrl .'getdownloaditem/collection/'
                . $this->alias . '/id/' . $pointer . '/type/compoundobject/show/1/cpdtype/document-pdf/filename/'
                . $document_structure['page'][0]['pagefile'] . '/width/0/height/0/mapsto/pdf/filesize/0/title/'
                . urlencode($document_structure['page'][0]['pagetitle']);

and see if we get a PDF from the server. So that would look like:

http://yourcdmutilsurl/getdownloaditem/collection/auarchives/id/499/type/compoundobject/show/1/cpdtype/document-pdf/filename/485.pdf/width/0/height/0/mapsto/pdf/filesize/0/title/Page%201

If you use curl to get that URL, what does the resulting file look like?

mjordan commented 6 years ago

If you don't mind sharing your CONTENTdm API URL with me I can take a look.

xing93111 commented 6 years ago

@bondjimbond has the URL but it requires a VPN connection. URL: http://deck.cs.athabascau.ca/dmwebservices/index.php?q=

xing93111 commented 6 years ago

@mjordan Here is the output:

billg@lib10:~$ curl http://digicon.athabascau.ca/getdownloaditem/collection/auarchives/id/499/type/compoundobject/show/1/cpdtype/document-pdf/filename/485.pdf/width/0/height/0/mapsto/pdf/filesize/0/title/Page%201
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" class="no-js">
<!-- CONTENTdm Version 6.8.0.412s/6.8.0.761w (c) OCLC 2011-2018. All Rights Reserved. //-->
<head>
  <meta name="robots" content="noindex,nofollow,noarchive" />
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

  <link rel="shortcut icon" type="image/x-icon" href="/ui/custom/default/collection/default/images/favicon.ico?version=1404943627" />

    <title>CONTENTdm Title</title>

  <script type="text/javascript">
    var cdmHttps = 'off';
    var cdmInsecureWebsitePort = '';
    var cdmSecureWebsitePort = '';
  </script>

  <link rel="stylesheet" type="text/css" href="/ui/custom/default/collection/default/css/main.css?version=1529334550" />
  <link type="text/css" href="/utils/getstaticcontent/file/js~bt~jquery.bt.css/type/stylesheet" rel="stylesheet" />
  <link type="text/css" href="/utils/getstaticcontent/file/js~skins~tango~skin.css/type/stylesheet" rel="stylesheet" />
  <link type="text/css" href="/utils/getstaticcontent/file/js~skins~cdm~skin.css/version/1401946701/type/stylesheet" rel="stylesheet" />

  <style>
    .line_breaker, pre {
        white-space: pre;
        white-space: pre-wrap;
        white-space: pre-line;
        white-space: -pre-wrap;
        white-space: -o-pre-wrap;
        white-space: -moz-pre-wrap;
        white-space: -hp-pre-wrap;
        word-wrap: break-word;
    } 
  </style>    

  <!-- NEW JQUERY and UI -->
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~jquery_1.7.2~jquery-1.7.2.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~jquery_1.7.2~jquery-ui-1.8.20.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~jquery-ui-togglebox.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~jquery.hoverIntent.minified.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~jquery.scrollTo-min.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~default.js/version/1401946702/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~modernizr-latest.js/type/javascript"></script>
  <!--[if lt IE 10]>
        <script type="text/javascript" src="/utils/getstaticcontent/file/js~cdmOldInternetExplorerChecker.js/type/javascript"></script>
    <![endif]-->

  <script type="text/javascript" src="/utils/getstaticcontent/file/js~bt~jquery.bt.min.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~quickview.js/type/javascript"></script>

    <!--[if IE]>
        <script type="text/javascript" src="/ui/cdm/default/collection/default/js/excanvas.compiled.js"></script>
    <![endif]-->
    <!--[if IE 7]>
        <link href="/ui/cdm/default/collection/default/css/ie7.css" type="text/css" rel="stylesheet" />
    <![endif]-->

  <script type="text/javascript">
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

    ga('create', 'UA-6471153-5');
    ga('send', 'pageview');
      </script>
  <script type="text/javascript" src="/ui/cdm/default/collection/default/js/cdm_ga.js"></script>

</head>

<body>

  <a name="top"></a>

<!-- HEADER --> 
<div id="headerWrapper" tabindex="1000">
    <p><img src="/ui/custom/default/collection/default/images/digiport_banner6.jpg" alt="" /></p>
    <span class="clear"></span>
    </div>

<!--  NAV_TOP -->
    <div id="nav_top">
        <div id="nav_top_left">
            <ul class="nav">
                  <li class="nav_li">
            <a tabindex="1001" id="nav_top_left_first_link" href="http://digiport.athabascau.ca"  >
              <div class="nav_top_left_text_container">Home</div>
            </a>
          </li>
                    <li class="nav_li">
            <a tabindex="1002"  href="/cdm/"  >
              <div class="nav_top_left_text_container">Browse</div>
            </a>
          </li>
                    <li class="nav_li">
            <a tabindex="1003"  href="http://digicon.athabascau.ca/cdm4/help.php"  >
              <div class="nav_top_left_text_container">Help</div>
            </a>
          </li>
                    <li class="nav_li">
            <a tabindex="1004"  href="http://digiport.athabascau.ca/copyright.html"  >
              <div class="nav_top_left_text_container">Copyright</div>
            </a>
          </li>
                    <li class="nav_li">
            <a tabindex="1005"  href="http://library.athabascau.ca"  >
              <div class="nav_top_left_text_container">Athabasca University Library</div>
            </a>
          </li>
                    <li class="nav_li">
            <a tabindex="1006"  href="http://digiport.athabascau.ca/"  >
              <div class="nav_top_left_text_container">Digitization Portal</div>
            </a>
          </li>

            </ul>
        </div>

        <div id="nav_top_right">
            <ul class="nav">
                            <li class="nav_li_right_1">
                    <span class=""><!--<a href="javascript:session_check(fx);" id="debug_session_check">Session Check</a>&nbsp;-&nbsp;<a href="javascript:session_auth();" id="debug_session_auth">Session Auth</a>&nbsp;-&nbsp;<a href="javascript:session_deauth();" id="debug_session_de-auth">Session De-Auth</a>&nbsp;-&nbsp;-->

                          <span class="currentUser" id="currentUser"></span><a tabindex="1007" id="login_link" href="http://digicon.athabascau.ca/login/" data-analytics='{"category":"navigation","action":"click","label":"Log in link"}'>Log in</a>
                                                                                    </span>
                </li>

                <li class="nav_li_right_1 nav_top_right_divider">|</li>

                <li class="nav_li_right_1">
                    <span class="icon_10 icon_nav_top_right ui-icon-help cdmHelpLink"></span><a tabindex="1008" class="cdmHelpLink" href="javascript:;" data-analytics='{"category":"navigation","action":"click","label":"Help link"}'><b>Help</b></a>
                </li>
                  <li class="nav_li_right_1 nav_top_right_divider">|</li>

          <li class="nav_li_right_1">
            <div id="nav_top_right_language_dd_link">
              <a tabindex="1009" href="javascript:;" id="nav_top_right_language_dd_link_text" data-analytics='{"category":"navigation","action":"open","label":"language selection menu"}'>
              English              </a><span class="icon_10 icon_nav_top_right ui-icon-triangle-1-s"></span>
            </div>
            <br />
            <div id="nav_top_right_language_dd_container">
              <div id="nav_top_right_language_dd_content">
                                  <div tabindex="1010" class="language_option cdm_selected_language" lang="en_US" data-analytics='{"category":"navigation","action":"click","label":"language: English"}'>English</div>
                                    <div tabindex="1011" class="language_option " lang="de" data-analytics='{"category":"navigation","action":"click","label":"language: Deutsch"}'>Deutsch</div>
                                    <div tabindex="1012" class="language_option " lang="es" data-analytics='{"category":"navigation","action":"click","label":"language: Español"}'>Español</div>
                                    <div tabindex="1013" class="language_option " lang="en_PIRATE" data-analytics='{"category":"navigation","action":"click","label":"language: Pirate English"}'>Pirate English</div>
                                    <div tabindex="1014" class="language_option " lang="ko" data-analytics='{"category":"navigation","action":"click","label":"language: 한국어 Korean"}'>한국어 Korean</div>
                                    <div tabindex="1015" class="language_option " lang="fr" data-analytics='{"category":"navigation","action":"click","label":"language: Français"}'>Français</div>
                                </div>
              <span class="clear"></span>
            </div>
                            </li>
            </ul>
        </div>
    </div>

<!-- BEGIN TOP CONTENT -->
    <div id="top_content">
        <div style="height:400px;width:500px;margin:0 auto;" valign="top">
  <div id="cdm_error" style="height:24px;width:500px;" class="float_left spacePad10 spaceMar30T ui-state-error ui-corner-all">
    <span class="icon_10 ui-icon-alert ui-icon-alert-cdmerror"></span>
    404: Page not found  </div>
</div>  </div>
<!-- END TOP CONTENT -->

<!-- FOOTER -->
  <span class="clear"></span>
  <div id="cdmFooterWrapper" class="spaceMar20T">
    <div id="nav_footer">
      <div id="nav_footer_left">
        <ul class="nav">
                      <li class="nav_footer_li"><a href="/cdm/">Home</a></li>
                              <li class="nav_footer_left_divider">|</li>
                              <li class="nav_footer_li"><a href="/cdm/about">About</a></li>
                              <li class="nav_footer_left_divider">|</li>
                              <li class="nav_footer_li"><a href="mailto:digi@athabascau.ca">Contact us</a></li>
                      </ul>
      </div>
      <div id="nav_footer_right"><ul class="nav">
        <li class="nav_footer_li"><a href="http://www.contentdm.org/" data-analytics='{"category":"navigation","action":"click","label":"Powered by CONTENTdm&reg; link"}'>Powered by CONTENTdm&reg;</a></li></ul>
      </div>
      <br /><br />
    </div>
    <span class="clear"></span>
  </div>

    <div id="login_dialog" title="Login" dialog_name="login_dialog"></div>

  <span class="clear"></span>
    <div id="content_footer"></div>

  <!-- language fields -->
  <input type="hidden" id="cdm_language_and" value="and" />
  <input type="hidden" id="cdm_language_or" value="or" />
  <input type="hidden" id="cdm_language_in" value="in" />
  <input type="hidden" id="cdm_language_advancedsearch" value="Advanced Search" />
  <input type="hidden" id="cdm_language_closeadvancedsearch" value="Close Advanced Search" />
  <input type="hidden" id="cdm_language_allofthewords" value="All of the words" />
  <input type="hidden" id="cdm_language_anyofthewords" value="Any of the words" />
  <input type="hidden" id="cdm_language_noneofthewords" value="None of the words" />
  <input type="hidden" id="cdm_language_theexactphrase" value="The exact phrase" />
  <input type="hidden" id="cdm_language_allfields" value="All fields" />
  <input type="hidden" id="cdm_language_error_enterAWordOrPhrase" value="Enter a word or phrase" />
  <input type="hidden" id="cdm_language_addorremovecollections" value="Add or remove collections" />
  <input type="hidden" id="cdm_language_limitsearchtospecificcollections" value="Limit search to specific collections" />
  <input type="hidden" id="cdm_language_failedtoretrieveitem" value="Failed to retrieve the item." />
  <input type="hidden" id="cdm_language_therewasaproblemrefreshingtheimage" value="therewasaproblemrefreshingtheimage" />
  <input type="hidden" id="cdm_language_close" value="Close" />
  <input type="hidden" id="cdm_language_login" value="Log in" />
  <input type="hidden" id="cdm_language_logout" value="Log out" />
  <input type="hidden" id="cdm_language_username" value="User Name" />
  <input type="hidden" id="cdm_language_password" value="Password" />
  <input type="hidden" id="cdm_language_cancel" value="Cancel" />
  <input type="hidden" id="cdm_language_ok" value="OK" />
  <input type="hidden" id="cdm_language_authenticating" value="Authenticating" />
  <input type="hidden" id="cdm_language_loading" value="loading..." />
  <input type="hidden" id="cdm_language_allCollections" value="All Collections" />
  <input type="hidden" id="cdm_language_remove" value="remove" />
  <input type="hidden" id="cdm_language_plus" value="Plus" />
  <input type="hidden" id="cdm_language_more" value="more" />
  <input type="hidden" id="cdm_language_foundindocument" value="found in document" />
  <input type="hidden" id="cdm_language_for" value="for" />

  <input type="hidden" id="cdm_language_error_nousernameentered" value="Please enter a user name." />
  <input type="hidden" id="cdm_language_error_nopasswordentered" value="Please enter a password" />
  <input type="hidden" id="cdm_language_error_authenticationfailed" value="Authentication Failed\nThe user name and/or password is not recognized.\nPlease check the spelling and try again." />
  <!-- end language fields -->

  </body>
</html>

mjordan commented 6 years ago

You need the 'utils' subdirectory. Try:

curl http://digicon.athabascau.ca/utils/getdownloaditem/collection/auarchives/id/499/type/compoundobject/show/1/cpdtype/document-pdf/filename/485.pdf/width/0/height/0/mapsto/pdf/filesize/0/title/Page%201

xing93111 commented 6 years ago

This is the response:

billg@lib10:~$ curl http://digicon.athabascau.ca/utils/getdownloaditem/collection/auarchives/id/499/type/compoundobject/show/1/cpdtype/document-pdf/filename/485.pdf/width/0/height/0/mapsto/pdf/filesize/0/title/Page%201
<?xml version="1.0"?>
<cpd>
    <type>Document</type>
  <page>
    <pagetitle>Page 1</pagetitle>
    <pagefile>485.pdf</pagefile>
    <pageptr>484</pageptr>
  </page>
  <page>
    <pagetitle>Page 2</pagetitle>
    <pagefile>486.pdf</pagefile>
    <pageptr>485</pageptr>
  </page>
  <page>
    <pagetitle>Page 3</pagetitle>
    <pagefile>487.pdf</pagefile>
    <pageptr>486</pageptr>
  </page>
  <page>
    <pagetitle>Page 4</pagetitle>
    <pagefile>488.pdf</pagefile>
    <pageptr>487</pageptr>
  </page>
  <page>
    <pagetitle>Page 5</pagetitle>
    <pagefile>489.pdf</pagefile>
    <pageptr>488</pageptr>
  </page>
  <page>
    <pagetitle>Page 6</pagetitle>
    <pagefile>490.pdf</pagefile>
    <pageptr>489</pageptr>
  </page>
  <page>
    <pagetitle>Page 7</pagetitle>
    <pagefile>491.pdf</pagefile>
    <pageptr>490</pageptr>
  </page>
  <page>
    <pagetitle>Page 8</pagetitle>
    <pagefile>492.pdf</pagefile>
    <pageptr>491</pageptr>
  </page>
  <page>
    <pagetitle>Page 9</pagetitle>
    <pagefile>493.pdf</pagefile>
    <pageptr>492</pageptr>
  </page>
  <page>
    <pagetitle>Page 10</pagetitle>
    <pagefile>494.pdf</pagefile>
    <pageptr>493</pageptr>
  </page>
  <page>
    <pagetitle>Page 11</pagetitle>
    <pagefile>495.pdf</pagefile>
    <pageptr>494</pageptr>
  </page>
  <page>
    <pagetitle>Page 12</pagetitle>
    <pagefile>496.pdf</pagefile>
    <pageptr>495</pageptr>
  </page>
  <page>
    <pagetitle>Page 13</pagetitle>
    <pagefile>497.pdf</pagefile>
    <pageptr>496</pageptr>
  </page>
  <page>
    <pagetitle>Page 14</pagetitle>
    <pagefile>498.pdf</pagefile>
    <pageptr>497</pageptr>
  </page>
  <page>
    <pagetitle>Page 15</pagetitle>
    <pagefile>499.pdf</pagefile>
    <pageptr>498</pageptr>
  </page>
</cpd>

mjordan commented 6 years ago

At http://digicon.athabascau.ca/cdm/ref/collection/auarchives/id/499, if I wanted to download the entire document as a single PDF, how would I do that? I don't see a link that will allow me to do that. Is there an admin option that turns off that feature, and if so, do you have it turned off?

xing93111 commented 6 years ago

I don't see a button allowing to download the entire compound object as a single PDF file and I don't find an option at the backend to turn it on/off. However, for this object: http://digicon.athabascau.ca/cdm/ref/collection/auriver/id/454, it has a download link. But I think it is a single object rather than a compound one.

mjordan commented 6 years ago

Correct, that is a single-file object, not a compound.

xing93111 commented 6 years ago

I think the manipulator has some problems. If I configure it like:

fetchermanipulators[] = "CdmCompound|Document-PDF"

It does not work because the output of the MIK is:

Commencing MIK.
Filtering 2 records through the CdmCompound fetcher manipulator.
==========================================================================================> 100%
Creating 0 Islandora ingest packages. Please be patient.

It just filtered out the two records in the collection. Then, I changed the manipulator like this:

fetchermanipulators[] = "CdmCompound|Document"

because I found the object types are

65,compound,Document
586,compound,Document

It does work but again I get corrupted PDF files because they are indeed XML files.

So I am thinking the manipulators section on this page:https://github.com/MarcusBarnes/mik/wiki/Toolchain:-CONTENTdm-compound-PDFs should not be restricted to

fetchermanipulators[] = "CdmCompound|Document-PDF"

mjordan commented 6 years ago

@xing93111 can you test compound PDF documents with MIK as it stood prior to #223 and the work that brought MIK in line with coding standards? Try commit 9c6b8c537f477fd82f20f3c6ba2563fcd30bd7f5. The compound PDF toolchain code at that commit is essentially how it stood when SFU migrated its compound PDFs (as far as the compound PDF document code anyway). You will need to adjust your .ini file to use CdmPhpDocuments and not 'CdmPdfDocuments` (which is what #223 fixed).

If this works for you, then there is a problem with the current MIK code that we need to fix; if it doesn't, then we need to confirm that your CONTENTdm can produce a single multiplage PDF from single-page PDFs (which we have not done yet) and go from there.

@MarcusBarnes does this seem like a reasonable way of narrowing down the problem?

Does anyone know of another CONTENTdm instance that we can test against?

xing93111 commented 6 years ago

@mjordan I don't see the class named CdmPhpDocuments on this page: https://github.com/MarcusBarnes/mik/tree/9c6b8c537f477fd82f20f3c6ba2563fcd30bd7f5/src/filegetters. I suppose this is the commit you would like me to pull out the code. If no such class, the command line will definitely fail

bondjimbond commented 6 years ago

@xing93111 You're looking at the current code rather than the code from the earlier commit. In your MIK directory:

git checkout -b CdmPhpDocuments

Then you'll need to git reset --hard 9c6b8c5

This will take you to the earlier commit... Look in src/filegetters to see what the filename is.

xing93111 commented 6 years ago

I still don't see the class. Here are my commands:

billg@lib10:/data4/test$ git clone https://github.com/MarcusBarnes/mik.git
Cloning into 'mik'...
remote: Enumerating objects: 18, done.
remote: Counting objects: 100% (18/18), done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 5254 (delta 6), reused 10 (delta 4), pack-reused 5236
Receiving objects: 100% (5254/5254), 1.47 MiB | 0 bytes/s, done.
Resolving deltas: 100% (3468/3468), done.
Checking connectivity... done.
billg@lib10:/data4/test$ ls
mik
billg@lib10:/data4/test$ cd mik
billg@lib10:/data4/test/mik$ ls
composer.json  CONTRIBUTING.md  LICENSE  phpunit.xml.dist  src
composer.lock  extras           mik      README.md         tests
billg@lib10:/data4/test/mik$ git checkout -b CdmPhpDocuments
Switched to a new branch 'CdmPhpDocuments'
billg@lib10:/data4/test/mik$ 
billg@lib10:/data4/test/mik$ git reset --hard 9c6b8c5
HEAD is now at 9c6b8c5 Work on #397.
billg@lib10:/data4/test/mik$ ls
composer.json  composer.lock  CONTRIBUTING.md  extras  LICENSE  mik  README.md  src  tests
billg@lib10:/data4/test/mik$ cd src
billg@lib10:/data4/test/mik/src$ ls
config               fetchers                filemanipulators      metadataparsers
exceptions           filegettermanipulators  inputvalidators       utilities
fetchermanipulators  filegetters             metadatamanipulators  writers
billg@lib10:/data4/test/mik/src$ cd filegetters
billg@lib10:/data4/test/mik/src/filegetters$ ls
CdmBooks.php       CdmPdfDocuments.php  CsvCompound.php    FileGetter.php          OaipmhXpath.php
CdmCompound.php    CdmSingleFile.php    CsvNewspapers.php  OaipmhIslandoraObj.php
CdmNewspapers.php  CsvBooks.php         CsvSingleFile.php  OaipmhOjsPdf.php
billg@lib10:/data4/test/mik/src/filegetters$

Anything wrong?

mjordan commented 6 years ago

I gave you the wrong commit hash. Try b6b8f0a280509cdae4ff11324c99ef14ffad8781, that puts the old filegetter back.

xing93111 commented 6 years ago

@mjordan It seems vendor folder missed in this version of the code. Here is the output:

billg@lib10:/data4/projects/arca$ ./mik/mik -c ./collections/AUebooks/config.ini
PHP Warning:  require(vendor/autoload.php): failed to open stream: No such file or directory in /data4/projects/arca/mik/mik on line 10
PHP Fatal error:  require(): Failed opening required 'vendor/autoload.php' (include_path='.:/usr/share/php') in /data4/projects/arca/mik/mik on line 10
billg@lib10:/data4/projects/arca$ cd mik
billg@lib10:/data4/projects/arca/mik$ ls
composer.json  CONTRIBUTING.md  LICENSE  README_DEV.md  src
composer.lock  extras           mik      README.md      tests

mjordan commented 6 years ago

When I check that commit out, vendor is still there. Did you try running composer update after you checked out b6b8f0a280509cdae4ff11324c99ef14ffad8781?

MarcusBarnes commented 6 years ago

Also good to run composer dump-autoload so that any new classes available via autoloading (after having run composer update to generate the vendor folder with any dependencies, etc.).

xing93111 commented 6 years ago

@MarcusBarnes got the vendor folder, but now:

billg@lib10:/data4/projects/arca$ ./mik/mik -c ./collections/AUebooks/config.ini
PHP Fatal error:  Uncaught Error: Class 'Commando\Command' not found in /data4/projects/arca/mik/mik:20
Stack trace:
#0 {main}
  thrown in /data4/projects/arca/mik/mik on line 20

mjordan commented 6 years ago

Do you still get that error after running composer dump-autoload?

xing93111 commented 6 years ago

After running composer dump-autoload, I have the vendor folder, but was caught by the above error Commando\Command not found.

mjordan commented 6 years ago

What do you see if you run ls vendor/nategood/commando/src/Commando/ from within the mik directory?

MarcusBarnes commented 6 years ago

@xing93111 Following up on @mjordan comment, double check if it's in your composer.json file (it might have been added after the commit that we're working from). If it's not there, over-write your exiting composer.json file with a copy of the latest composer.json file and then run composer install, the composer dump-autoload.

xing93111 commented 6 years ago

The command line works now @MarcusBarnes. However, it still outputs corrupted PDFs as I mentioned above: https://github.com/MarcusBarnes/mik/issues/492#issuecomment-431150182

mjordan commented 6 years ago

@xing93111 we could go back further in time until it works, but I am not convinced that your CONTENTdm is the same as SFU's was. Is there any way we can confirm that it can in fact allow a user to download a single multipage PDF from a compound PDF object?

bondjimbond commented 6 years ago

@xing93111 More specifically... are there any objects where this is the case? And/or, can you try creating compound PDF object in your CDM with the option to download a full one?

If this is just a problem with the Athabasca CDM instance, it may be more productive to close the issue and just use the Automator scripts we discussed to convert the PDF pages to TIFFs.

mjordan commented 6 years ago

And/or, can you try creating compound PDF object in your CDM with the option to download a full one?

Sorry, that is exactly what I think we need to confirm before looking closer at the MIK code. If we can confirm that the Athabasca CDM instance can produce multipage PDFs from compound single-page PDF documents, we will have narrowed the issue down to the MIK code, which we can then fix.

MarcusBarnes / mik

Toolchain: CONTENTdm compound PDFs #492