Daniel-KM / Omeka-plugin-CsvImportPlus

This fork of the plugin for Omeka allows users to import metadata of items from a simple CSV file. Improvements are: choice of separators, import of metadata of files and collections, update of any record, import of files one by one to avoid to overload the server.
http://omeka.org/codex/Plugins/CsvImport
22 stars 17 forks source link
csv import omeka omeka-plugin

CSV Import+ (plugin for Omeka)

CSV Import+ is a plugin for Omeka that allows users to import or update items from a simple CSV (comma separated values) file, and then map the CSV column data to multiple elements, files, and/or tags. Each row in the file represents metadata for a single item. This plugin is useful for exporting data from one database and importing that data into an Omeka site.

This fork adds some improvments:

It can be installed simultaneously with the upstream CSV Import.

The similar tool XML Import can be useful too, depending on your types of data.

Installation

Uncompress files and rename plugin folder "CsvImportPlus".

Then install it like any other Omeka plugin and follow the config instructions.

If you want to use local files inside the file system, the allowed base path or a parent should be defined before in the file "security.ini" of the plugin.

Set the proper settings in config.ini like so:

plugins.CsvImportPlus.columnDelimiter = ","
plugins.CsvImportPlus.enclosure = '"'
plugins.CsvImportPlus.memoryLimit = "128M"
plugins.CsvImportPlus.requiredExtension = "txt"
plugins.CsvImportPlus.requiredMimeType = "text/csv"
plugins.CsvImportPlus.maxFileSize = "10M"
plugins.CsvImportPlus.fileDestination = "/tmp"
plugins.CsvImportPlus.batchSize = "1000"

All of the above settings are optional. If not given, CSV Import+ uses the following default values:

memoryLimit = current script limit
requiredExtension = "txt" or "csv"
requiredMimeType = "text/csv"
maxFileSize = current system upload limit
fileDestination = current system temporary dir (via sys_get_temp_dir())
batchSize = 0 (no batching)

Set a high memory limit to avoid memory allocation issues with imports. Examples include 128M, 1G, and -1. This will set PHP's memory_limit setting directly, see PHP's documentation for more info on formatting this number. Be advised that many web hosts set a maximum memory limit, so this setting may be ignored if it exceeds the maximum allowable limit. Check with your web host for more information.

Note that 'maxFileSize' will not affect 'post_max_size' or 'upload_max_filesize' as is set in 'php.ini'. Having a maxFileSize that exceeds either will still result in errors that prevent the file upload.

'batchSize': Setting for advanced users. If you find that your long-running imports are using too much memory or otherwise hogging system resources, set this value to split your import into multiple jobs based on the number of CSV rows to process per job.

For example, if you have a CSV with 150000 rows, setting a batchSize of 5000 would cause the import to be split up over 30 separate jobs. Note that these jobs run sequentially based on the results of prior jobs, meaning that the import cannot be parallelized. The first job will import 5000 rows and then spawn the next job, and so on until the import is completed.

Important

On some servers, in particular with shared hosts, an option should be changed in the application/config/config.ini file:

jobs.dispatcher.longRunning = "Omeka_Job_Dispatcher_Adapter_BackgroundProcess"

by

jobs.dispatcher.longRunning = "Omeka_Job_Dispatcher_Adapter_Synchronous"

Note that this change may limit the number of lines imported by job. If so, you can increase the time limit for process in the server or php configuration.

Note about local paths

For security reasons, to import files from local file system is forbidden. Nevertheless, it can be allowed for a specific path. This allowed base path or a parent should be defined in the file "security.ini" of the plugin.

Examples

Since release 2.2-full, only the "Manage" format is available. Some tests are incompatible with this one, so change their headers to process them. Generally, to set the "Dublin Core : Title" of "Dublin Core : Identifier" as the required identifier is enough to process a test.

Fifteen examples of csv files are available in the csv_files folder. They are many because a new one is built for each new feature. The last ones uses all of them.

Some files may be updated with a second file to get full data. This is just to have some examples.

They use free images of Wikipedia, so import speed depends on the connection.

The first three tests use the same items from Wikipedia, so remove them between tests.

  1. test.csv

    A basic list of three books with images of Wikipedia, with non Dublin Core tags. To try it, you just need to check Item metadata, to use the default delimiters , and enclosure ". The identifier field is "Dublin Core : Title" and extra data are "Perhaps", so a manual mapping will be done, where the special value "Identifier" will be set to the title.

  2. test_automap.csv

    The same list with some Dublin Core attributes in order to automap the columns with the Omeka fields. To try it, use the same parameter than the previous file. The plugin will try to get matching columns if field names are the same in your file and in the drop-down list.

  3. test_special_delimiters.csv

    A file to try any delimiters. Special delimiters of this file are:

    • Column delimiter: tabulation
    • Enclosure: quotation mark "
    • Element delimiter: custom ^^ (used by Csv Report)
    • Tag delimiter: double space
    • File delimiter: semi-colon

    Extra data can be set to "Perhaps". If set to "No", then the second step will be skipped.

  4. test_files_metadata_full.csv and test_files_metadata_update.csv

    A file used to import metadata of files. The first is autonomous, so the previous files didn't need to be imported. To try the second, you should import items before with any of the previous csv files. Then, select tabulation as column delimiter, no enclosure, and | as element, file and tag delimiters, and Dublin Core:Identifier as default identifier. Then, you can import it manually or automatically. If manually, set "Perhaps" for extra data, then the special values "Identifier field" to the identifier field and "Identifier" to the filename. There is no extra data.

  5. test_mixed_records.csv

    A file used to show how to import metadata of item and files simultaneously, and to import files one by one to avoid server overloading or timeout. To try it, check Mixed records in the form and choose tabulation as column delimiter, no enclosure, and | as element, file and tag delimiters.

    Note: in the csv file, the file rows should always be after the item to which they are attached, else they are skipped.

    This file is not compatible with the release 2.2 for an automatic import.

  6. test_mixed_records_update.csv

    A file used to show how to update metadata of item and files. To try it, import test_mixed_recods.csv above first, then choose this file and check Update records in the form.

    This file is not compatible with the release 2.2 for an automatic import.

  7. test_collection.csv

    Add two items into a new collection. A created collection is not removed if an error occurs during import. Parameters are tabulation as column delimiter, no enclosure and | as element, file and tag delimiters. The identifier is "Dublin Core : Identifier".

  8. test_collection_update.csv

    Update metadata of a collection.

    Parameters are the same as in the previous file.

  9. test_collection_update_bis.csv

    Insert a new item in a collection selected in the form.

    Parameters are the same as in the previous file, but set a default collection.

  10. test_extra_data.csv

    Show import of extra data that are not managed as elements, but as data in a specific table. The mechanism processes data as post, so it can uses the default hooks, specially after_save_item.

    To try this test file, install Geolocation first. Set tabulation as column delimiter, no enclosure, and | as element, file and tag delimiters. You should set the required identifier to "Dublin Core : Identifier", the option "Contains extra data" to "Yes" too (or "Perhaps" to check manually). Use the update below to get full data for all items.

    The last row of this file shows an example to import one item with attached files on one row (unused columns, specially Identifier and Record Type, can be removed). This simpler format can be used if you don't need files metadata or if you don't have a lot of files attached to each item.

  11. test_extra_data_manual.csv

    This file has the same content than the previous, but header are not set, so you should set "Contains extra data" to "Perhaps" to map them to the Omeka metadata. Note that extra data should kept their original headers.

  12. test_extra_data_update.csv

    Show update of extra data. To test it, you need to import one of the two previous files first, then this one, with the same parameters.

  13. test_manage_one.csv

  14. test_manage_two.csv

  15. test_manage_script.csv

    These files show how to use the "Manage" process. They don't use a specific column, but any field. So, each row is independant from others. The first allows to import some data and the second, similar, has got new and updated content, because there are errors in the first. The third is like a script where each row is processed one by one, with a different action for each row.

    To try them, you may install Geolocation and to use tabulation as column delimiter, no enclosure, | as element, file, and tag delimiters, and Dublin Core:Identifier as the field identifier. If you import them manually, the special value "Identifier" should be set too for the Dublin Core:Identifier, so this column will be used as identifier and as a metadata. The third should be imported after the first and the second to see changes.

CSV Format

Since the version 2.2-full, only one format is available. Use the upstream release for the other formats, or the release tagged "2.1.5-full", that is the last with all formats (but fixed bug aren't backported).

Anyway, this format, previously named Manage records, allows to manage creation, update and deletion of all records with the same file, or different ones if you want. See below for possible actions.

Be warned that if you use always the same csv file and that you update records from the Omeka admin board too, they can be desynchronized and overwritten.

Each row is independant from the other. So a file can be imported before an item and an item in a collection that doesn't exist yet.

Three columns may be used to identify records between the csv file and Omeka. If they are not present, the default values will be used.

The column "Item" is required to identify the item to which the file is attached. It contains the same identifier as above.

To import metadata of files alone, the column "Identifier Field" and "File" are required.

Notes

Depending of your environment and database, if you imports items with encoded urls, they should be decoded when you import files. For example, you can import an item with the file Edmond_Dant%C3%A8s.jpg, but you may import your file metadata with the filename Edmond_Dantès.jpg. Furthermore, filenames may be or not case sensitive.

Files that are attached to an item can be fully updated. If the url is not the same than existing ones, the file will be added. If it is the same, the file will be reimported. To reimport a file with the same url, you should remove it first. This process avoids many careless errors. To update metadata of a file, the column for the url ("File") should be removed. Files are ordered according to the list of files. Note : This process works only when original filenames are unique. So, the simplest is to set a unique identifier for files too.

The status page indicates situation of previous, queued and current imports. You can make an action on any import process.

Note that you can't undo an update, because previous metadata are overwritten.

The column "Skipped rows" means that some imported lines were non complete or with too many columns, so you need to check your import file.

The column "Skipped records" means that an item or a file can't be created, usually because of a bad url or a bad formatted row. You can check error.log for information.

The count of imported records can be different from the number of rows, because some rows can be update ones. Furthermore, multiple records can be created with one row. Files attached directly to items are not counted.

The column Action allows to set the action to do for the current row. This parameter is optional and can be set in the first step of import.

The actions can be (not case sensitive):

Important

This mode doesn't apply to extra data, because the way the plugins manage updates of their data varies. So existing data may be needed in the update file in order to not be overwritten (this is the case for the Geolocation plugin).

Extra data are managed by plugins, so some differences should be noted.

In some cases, in particular when the item is saved in another process while the import job is still working in background, order of files can be broken. In that case, simply reorder them. A batch edit form can be do it automatically (select items in items/browse and click the main button "Edit", then check the box for CSV Import+ / Order files by filename).

Warning

Use it at your own risk.

It’s always recommended to backup your files and your databases and to check your archives regularly so you can roll back if needed.

Troubleshooting

See online CSV Import issues and CSV Import+ issues.

License

This plugin is published under GNU/GPL.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Contact

Current maintainers:

This plugin has been built by Center for History & New Media. Next, the release 1.3.4 has been forked for University of Iowa Libraries and upgraded for École des Ponts ParisTech and Pop Up Archive. The fork of this plugin has been upgraded for Omeka 2.0 for Mines ParisTech.

Copyright