EOL / tramea

A lightweight server for denormalized EOL data
Other
2 stars 1 forks source link

XLS to DwC-A converter #223

Open KatjaSchulz opened 8 years ago

KatjaSchulz commented 8 years ago

Our old XLS to DwC-A converter isn't working anymore, and uploading XLS spreadsheets directly to the EOL content partner registry doesn't work anymore. Eventually, we would like to bring back this functionality, but in the meantime, it would be great to simply have a stand-alone XLS to DwC converter for TraitBank spreadsheets. We could then simply run our spreadsheets through this converter and point the content partner registry to the DwC-Archives created by the converter.

eliagbayani commented 8 years ago

Hi Katja, @jhammock I think I may have an idea of what the problem is. For the meantime, can you send me a link to all of your spreadsheets that needs to be converted to DwC-A. I will convert it locally first on my machine then submit it to you the DwC-A. If everything looks good then I will proceed fixing up our tool. Thanks.

jhammock commented 8 years ago

Thank you, Eli! I have a zipfile for you of resources I need to reharvest- they'll be updates of existing resources, but I don't imagine that will matter: https://www.dropbox.com/s/y78ikoddp6e37sn/to%20harvest.zip?dl=0

@KatjaSchulz , @millerse, if you have spreadsheets ready to go, either on your desktop or in some funny thwarted status in the resource upload interface, please pass them to Eli.

If you can fix the original tool, Eli, that's great, but if it's easier to start from scratch, we're open to all plausible ideas. Let us know how you'd like to proceed. :)

eliagbayani commented 8 years ago

Hi Jen, Good news, I don’t think there is a need to start from scratch here.

All of your spreadsheets were processed. Since #1 to #4 are small, they were processed by our tool (http://services.eol.org/xls2dwca/; but locally from my machine at the moment). While #5 (mineralogy 2015.xlsx) the big spreadsheet was processed by a connector (799.php).

So theoretically, I can also update this connector to be generic and process any template spreadsheet and generate the DWC-A file.

-1. Barton Finkel 2013.xlsx (VALID resource) => https://dl.dropboxusercontent.com/u/7597512/spreadsheets/resources/dwca_20411.tar.gz

http://rs.tdwg.org/dwc/terms/taxon:113 http://eol.org/schema/reference/reference:90 http://rs.tdwg.org/dwc/terms/measurementorfact:559

-2. Barton Pershing 2013.xlsx (VALID resource) => https://dl.dropboxusercontent.com/u/7597512/spreadsheets/resources/dwca_55910.tar.gz

http://rs.tdwg.org/dwc/terms/taxon:29 http://eol.org/schema/reference/reference:4 http://rs.tdwg.org/dwc/terms/measurementorfact:181

-3. Olenina 2006.xlsx (VALID resource) => https://dl.dropboxusercontent.com/u/7597512/spreadsheets/resources/dwca_56340.tar.gz http://rs.tdwg.org/dwc/terms/taxon:900 http://rs.tdwg.org/dwc/terms/measurementorfact:26053

-4. bioluminescent.xlsx (PARTIALLY VALID resource, spreadsheet needs to be fixed) => https://dl.dropboxusercontent.com/u/7597512/spreadsheets/resources/dwca_67664.tar.gz

http://rs.tdwg.org/dwc/terms/taxon:782 http://rs.tdwg.org/dwc/terms/measurementorfact:730

Errors File: taxa.txt Line: 751, 766, 767, 771 Message: Duplicate identifiers

-5. mineralogy 2015.xlsx (VALID resource) => https://dl.dropboxusercontent.com/u/7597512/spreadsheets/resources/799.tar.gz

http://rs.tdwg.org/dwc/terms/taxon:293066 http://rs.tdwg.org/dwc/terms/measurementorfact:986311 http://eol.org/schema/reference/reference:63 This is quite a big spreadsheet. I did process this with an existing connector (799.php).

eliagbayani commented 8 years ago

For developers:

Using our tool (http://services.eol.org/xls2dwca/) locally from my machine. I’ve listed below what worked for me when it comes to processing a rather big spreadsheet. The problem maybe different in our server though. These updates are basically to accommodate a big size spreadsheet, so I’m not sure if this is feasible for our server since we don’t like big uploads to the server.

On my local machine, I did the following: -1. In PHP.ini (PHP configuration file), I set a bigger size (e.g. 50M) for these two variables:

upload_max_filesize = 50M post_max_size = 50M

-2. In our PHP codebase, I just added a timeout of one hour for the actual upload request. Again may not be feasible for our server. /eol_php_code/applications/xls2dwca/form_result.php : if($temp_dir = ContentManager::download_temp_file_and_assign_extension($url, $suffix, array('suffix' => $suffix, 'timeout' => 3600))) //1 hour timeout (3600 seconds)

millerse commented 8 years ago

@eliagbayani Hi Eli,

I have a number of sheets for you: https://www.dropbox.com/s/bnrbmrttgithwa1/Avian%20body%20sizes%20in%20relation%20to%20fecundity%2C%20mating%20system%2C%20display%20behavior%2C%20and%20resource%20sharing%20Export.xls?dl=0 https://www.dropbox.com/s/9b4ie0g024wszy6/carnivore%20dinosaurs.xlsx?dl=0 https://www.dropbox.com/s/llska2bdzl0elo1/climates.xls?dl=0 https://www.dropbox.com/s/faxizw1w9lo9j1e/dana%20dinosaur%20transfer.xlsx?dl=0 https://www.dropbox.com/s/wyrr5o56feycbaj/DC%20Flower%20Export.xls?dl=0 https://www.dropbox.com/s/ximiz5oiedvgdee/eastern%20export.xlsx?dl=0 https://www.dropbox.com/s/yiywpg5pyc3w102/EGG%20CHARACTERISTICS%20AND%20BREEDING%20SEASON%20FOR%20WOODS%20HOLE%20SPECIES.xls?dl=0 https://www.dropbox.com/s/sb6snwmg3f3mu9j/Eggs%20copy.xlsx?dl=0 https://www.dropbox.com/s/vaysqf4w2aok34j/Eggs.xls?dl=0 https://www.dropbox.com/s/o97ue4gkya9s6br/Life%20history%20characteristics%20of%20placental%20non-volant%20mammals.xls?dl=0 https://www.dropbox.com/s/tkd9xijxrkp509a/Macroecological%20mammalian%20body%20mass%20copy.xlsx?dl=0 https://www.dropbox.com/s/om3fhew94hx2kva/Male%20tenure%20length%20and%20variance%20in%20lifetime%20reproductive%20success%20recorded%20for%20mammals%20Transfer.xls?dl=0 https://www.dropbox.com/s/qx4ol1sugo6lqdw/Milk%20Traitbank%20Import.xlsx?dl=0 https://www.dropbox.com/s/nm50iktes7cg6kw/Reptile%20Export.xls?dl=0 https://www.dropbox.com/s/u17km6pnylf6cx3/WWF%202.xlsx?dl=0 https://www.dropbox.com/s/dlybjsx410h90rh/WWF%20Habitats.xlsx?dl=0 https://www.dropbox.com/s/k49tww9xgb2xd8k/WWF.xlsx?dl=0 https://www.dropbox.com/s/fh0jz5n4p0b4g39/Dragonfly%20Other%20Mes.xlsx?dl=0 https://www.dropbox.com/s/2j3zxjmhurckd6x/Avian%20Mass%20Export.xlsx?dl=0 https://www.dropbox.com/s/y77wsxfdo22c6t0/Coral%20Skeletons.xlsx?dl=0 https://www.dropbox.com/s/tpg23r34brmrmmf/Life%20history%20data%20of%20lizards%20of%20the%20world%20export.xlsx?dl=0 https://www.dropbox.com/s/li8l8r18bjpzbi9/Mikesell%20phenological%20data.xlsx?dl=0 https://www.dropbox.com/s/kawt45g3auvrvfl/Parrot%20Fish.xlsx?dl=0 https://www.dropbox.com/s/vtehcwnkcbfhvby/PterosaurData%20Transfer.xlsx?dl=0 https://www.dropbox.com/s/zjxx83uj4ogo0fn/Social%20systems%20of%20mammalian%20species%20Transfer.xlsx?dl=0 https://www.dropbox.com/s/0rfmqrj97v6e37y/Dragonflies%20Measurements%202.xlsx?dl=0

eliagbayani commented 8 years ago

Hi Jen, @millerse, These spreadsheets from @millerse have unknown URIs:

https://www.dropbox.com/s/9b4ie0g024wszy6/carnivore%20dinosaurs.xlsx?dl=0

https://www.dropbox.com/s/yiywpg5pyc3w102/EGG%20CHARACTERISTICS%20AND%20BREEDING%20SEASON%20FOR%20WOODS%20HOLE%20SPECIES.xls?dl=0

https://www.dropbox.com/s/qx4ol1sugo6lqdw/Milk%20Traitbank%20Import.xlsx?dl=0

https://www.dropbox.com/s/fh0jz5n4p0b4g39/Dragonfly%20Other%20Mes.xlsx?dl=1 Unknown URIs:

https://www.dropbox.com/s/tpg23r34brmrmmf/Life%20history%20data%20of%20lizards%20of%20the%20world%20export.xlsx?dl=0

https://www.dropbox.com/s/zjxx83uj4ogo0fn/Social%20systems%20of%20mammalian%20species%20Transfer.xlsx?dl=0

https://www.dropbox.com/s/0rfmqrj97v6e37y/Dragonflies%20Measurements%202.xlsx?dl=0

millerse commented 8 years ago

@eliagbayani Thank you. I will take care of those.

jhammock commented 8 years ago

Thanks! @millerse if you run into any that I should have provided you and are missing, let me know!

millerse commented 8 years ago

@jhammock Will do.

eliagbayani commented 8 years ago

Thanks everybody. Also just FYI. Spreadsheets in .XLSX version is MUCH MORE memory intensive (harder) to process compared to .XLS version. If you have the option to choose, please use/save your spreadsheets as .XLS version. Thanks.

millerse commented 8 years ago

@eliagbayani Do I need to do anything else to help you Eli?

eliagbayani commented 8 years ago

@millerse thanks. Just please inform me once all those unknown URIs I listed above have already been entered in our list, then I should be able to generate all the DWC-A files for those spreadsheets (~20+ total). While the big spreadsheets (around 4 of them) will be available maybe Thursday since I'm going to handle them differently in our connector. Thanks.

millerse commented 8 years ago

@eliagbayani I added those URI's to the site. They should be ready to go. Thanks for the reply.

eliagbayani commented 8 years ago

Hi @millerse I moved to a separate ticket the findings on your spreadsheets since issue-223 is for the converter tool. Thanks.

jhammock commented 8 years ago

Thanks, Eli! I think you've given us enough resources to harvest until we're ready to tackle the large DwC-A again. What are our prospects for standing up the converter on EOL-Archive? If that's likely to be a lot of work, there are only a handful of us who would need to be able to use it. If it's more practical for you to help us install it locally on our machines, that's an option. Katja has PHP and Python on her Macbook already. Let us know.

We presume that larger resources, which don't come along quite so often, we should hand to you to put through the connector. Does that sound right? If so, please let us know the cutoff size we should observe between the two methods. Thanks again

eliagbayani commented 8 years ago

Hi @jhammock ,

Thanks.

jhammock commented 8 years ago

Sounds good! When we have new resources to convert, @millerse , @KatjaSchulz and I will post them in https://github.com/EOL/tramea/issues/235, and I'll ask for https://github.com/EOL/tramea/issues/213 next iteration so you can start work on Archive.

jhammock commented 8 years ago

@eliagbayani may I have another spreadsheet conversion, please? :)

https://www.dropbox.com/s/bksnchuqlzeu0bj/EdwardsEtAl2015.xlsx?dl=0

oops, make that two:

https://www.dropbox.com/s/e1uzf1hcrdi4nbr/animal%20seed%20size.xlsx?dl=0

last one today, I promise:

https://www.dropbox.com/s/ioevc9xye354us9/Benedetti_2015.xlsx?dl=0

eliagbayani commented 8 years ago

Hi @jhammock , here it is: EdwardsEtAl2015.xlsx is VALID - DWC-A

Benedetti_2015.xlsx is VALID - DWC-A Some undefined URIs from (Benedetti_2015.xlsx):

animal seed size.xlsx is VALID - DWC-A

Hi Jen, please take note of the erroneous URIs for the following columns in (animal seed size.xlsx):