EOL / deprecated_eol_php_code

Encyclopedia of Life
http://eol.org/
Other
5 stars 7 forks source link

Store crop location in DB and use hard links if possible #111

Closed hyanwong closed 9 years ago

hyanwong commented 9 years ago

As discussed with Jeremy Rice, this code implements several improvements to ContentManager.php, in particular in the grab_file() function:

1) When grab_file is called with a local file rather than a url, the code now tries to create a hard link to that file, instead of downloading a new copy (if it can't it falls back to downloading as before). This means when we do a custom crop of an image, we can keep the original download (which may be in a png, tiff, etc form, and so potentially a large file) without increasing disk usage. Previously, cropping an image resulted in losing the original downloaded file and retaining only an EOL-created jpg version.

2) When grab_file() is called on an image with a data_object_id given in the $options parameter, we store the height and width of the image in the image_sizes table in the EOL database (done by functions called within create_local_files_and_thumbnails()). I have checked that all calls to grab_file(xxx, "image") in the EOL codebase have been changed to pass in the data_object_id, apart from cases where we are grabbing thumbnails, or where we are testing the server (e.g. in test_create_load.php). Previously we did not store this image data.

3) When a custom crop is called (via crop_image()), the values are converted to percentages, then passed in to grab_file, which stores them in the image_sizes table. When grab_file is called on an image that already has a data_object_id, we take any previous crop values from the database and re-apply them, so that we get the same crops even when we reharvest. Previously, reharvesting caused all user-initiated crops to be lost.

4) Unit tests have been implemented for all these scenarios, in test_content_manager.php.

All this has involved a substantial rewrite of the code in ContentManager.php. I have tried to keep the logic exactly the same, and I have conducted limited testing (including running the unit tests). However, I do not have a full EOL setup on which to test this new code, so further testing should probably be done on the EoL development system. One particular case is where resources or datasets are in a tar/zip/gzip files and are unpacked by the grab_file routine. There is no unit test for this and I have commented that one should be added (there are also no unit tests of grab_file() for types "partner" and "dataset"). It might be a useful job for the new developers to check on the logic flow of my new code and add test cases.

There are a few places where I have kept the existing code but I think it should be changed or is redundant. I have marked these with a comment which includes 4 asterisks ****. The most major of these is a suggestion that translating the crop coordinates to percentages should be done by the Ruby front-end, since these calculations are dependent on features of the web page (in particular, the restriction of 540 pixel width for displaying the croppable image). It would both be easier and more logical to do web-page specific calculations in the RoR code than to pollute the php code with web-specific parameters as happens at the moment. As a stop-gap measure I have inserted a function crop_image_pct() which can be called by if the Ruby code is altered to produce x, y, and width parameters as percentages from the 580_360.jpg file.

There are also cases where I have implemented a "hack" to work around the fact that EoL does not store on a database the downloaded file extension or (in the case of images) details of any rotations that have been done to the original. There is a good argument that these should be kept (extension in the data_objects table, rotation details in the image_sizes table.

JRice commented 9 years ago

There are a couple of problems with this, but I suspect they shall be simple enough to solve.

First, I was getting a lot of errors from expressions like stat($new_local_file)['nlink'], and I ended up having to break them up thusly:

$stat_new_local = stat($new_local_file);
$this->assertTrue($stat_new_local['nlink'] == 2, 'Should have two hard links for the new file');

...I suspect you may be using a somewhat different version of PHP than we are (5.3), otherwise I'm not sure why that's a syntax error for us.

Secondly (and prohibitively--I was happy enough to make those changes above, though ideally it should be extracted to a method... but this next problem wasn't one I could easily fix), I get this:

Fatal error: Call to undefined method php_active_record\ContentManager::cache_path() in /home/vagrant/git/eol_php_code/tests/unit/test_content_manager.php on line 106

...And, indeed, I'm not finding that method anymore, and I wasn't sure (without digging) whether your intent was to replace it with some permutation of cache_path2num().

Sorry!

hyanwong commented 9 years ago

On 2 Dec 2014, at 17:57, Jeremy Rice notifications@github.com wrote:

$stat_new_local = stat($new_local_file); $this->assertTrue($stat_new_local['nlink'] == 2, 'Should have two hard links for the new file'); ...I suspect you may be using a somewhat different version of PHP than we are (5.3), otherwise I'm not sure why that's a syntax error for us.

Yep, this is simply a php thing:

http://stackoverflow.com/questions/1459377/access-array-returned-by-a-function-in-php

Yuck - another reason to hate php.

Fatal error: Call to undefined method php_active_record\ContentManager::cache_path() in /home/vagrant/git/eol_php_code/tests/unit/test_content_manager.php on line 106

...And, indeed, I'm not finding that method anymore, and I wasn't sure (without digging) whether your intent was to replace it with some permutation of cache_path2num().

I’ve replaced the function cache_path() with cache_path2num(). Seee e.g. https://github.com/hyanwong/eol_php_code/commit/3b59c62afec2d2b8e4da3ccc0636c9eb6f9c1ed9

You should be able to do a simple like-for-like replacement. I think the problem is simply that Eli made a commit that used the old name. As well as more informative, the change in name is needed because we now have a cache_path2num() and a cache_num2path().

What do you think about my suggestion for putting the CSS-dependent stuff in the Ruby code?=

JRice commented 9 years ago

...Mind fixing these things (by rebasing with master, in the case of Eli's conflict) and submitting another pull request? Sorry for the hassle, but I appreciate the help.

I think having the Ruby code calculate the percentages is fine. ...If the PHP can support both pct and non-pct parameters, then the timing of the change on the Ruby end won't be important. (Apologies if this is what you've already done, I've not checked.)

hyanwong commented 9 years ago

On 2 Dec 2014, at 20:34, Jeremy Rice notifications@github.com wrote:

...Mind fixing these things (by rebasing with master, in the case of Eli's conflict) and submitting another pull request? Sorry for the hassle, but I appreciate the help.

Just done. For reasons I don’t understand I couldn’t rebase, so I just merged with master instead. Also added a function for link number, as you asked.

I think having the Ruby code calculate the percentages is fine. ...If the PHP can support both _pct and non-_pct parameters, then the timing of the change on the Ruby end won't be important. (Apologies if this is what you've already done, I've not checked.)

Yep, there are 2 functions now, one using percentages. Once this is in, I can change the ruby code and it should all carry on working.

Do make sure you try my code on a test system first: I’m a little wary of making these big changes to a core bit of functionality. In particular, I’m lacking a test for zipped resource files - and grab_file does fancy stuff with those (unzips them, replaces previously existing stuff etc). I see there is a unit test for non-zipped resources, using the file tests/fixtures/files/test_resource.xml. I’d add a test for (g)zipped resources, but I don’t know where to get a test file from.

I also don’t really understand what grab_file is supposed to do for the “dataset” type: it seems to call delete_old_datasets() for any datasets >14 days old, but that’s all patrick’s code. I hope I’ve left it unchanged.

Yan=