gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
127 stars 57 forks source link

Uploading big data files (8GB) #1300

Closed fpadilla closed 7 years ago

fpadilla commented 7 years ago

Hello everyone

I setting up the an IPT node for the Cornell Lab of Ornithology. We have a data file of observations of 8 GB compressed, with around 270 millions of records. There is a limit of 100MB, how can I upload this file to our node? Can I change this limit?

Thanks

Francisco.

kbraak commented 7 years ago

Thanks for your question Francisco.

For publishing an extremely large data set such as this, I would actually recommend working with a small subset of records first. This will allow you to iteratively refine it much more easily, since each publishing round won't take as long.

Assuming you have access to the IPT Data Directory, here are the steps to follow:

  1. Create a smaller file containing the first few lines from the larger original file.
  2. Upload this smaller file as a new data source for the existing IPT resource.
  3. Perfect your mappings and metadata and ensure it publishes successfully.
  4. Manually replace the smaller file with the larger one on the file system. Note the file is located inside $IPT_DATA_DIR/resources/[shortname]/sources You will need to preserve the file's name and column structure for the swap to work though.

Hope this approach works for you and good luck publishing.

fpadilla commented 7 years ago

Thank you @kbraak !