Projections and read velocity

gasparesganga / php-shapefile

PHP library to read and write ESRI Shapefiles, compatible with WKT and GeoJSON

https://gasparesganga.com/labs/php-shapefile/

MIT License

140 stars 50 forks source link

Projections and read velocity #40

Open pablodegrande opened 3 years ago

pablodegrande commented 3 years ago

Hi, it's me again. I have two pieces of code to share with you... it's not a merge request, as I prefer to discuss first if you are interested:

first one is about performance. we have exchanged messages about write performance a while ago. I am now implementing uploading shapefiles in my open source geo app (https://github.com/poblaciones/poblaciones). I've added 'virtualization' to freads by implementing a class that exposes fread, fseek and fopen but internally reads data by blocks of 40kb. It's only for reading.
projections: it takes little code to re-project the entities using proj4php after reading a shapefile file, and projections can be "the problem" of importing shapefiles. I know it's not in the basics of your library, but as a code example, or as a method (geopandas has a method ->tocrs() for it's shape collections) i think it would be nice to have it crafted in your code. I have implemented a way to do it after reading the shapefile... i could send it to you too.
Regards, Pablo.

gasparesganga commented 3 years ago

Hi Pablo! In the past your feedback, ideas and requests proved to be probably the best I had ever got, helping me to improve this library a lot. I'll happily answer to both points:

Reading buffer The first point sounds feasible and my gut feeling is that it should work out of the box, because if you have a class that exposes some fstat, ftell, fseek, and fread methods (regardless of their internal implementation within your class), my library should be able to treat it like a stream resource . The only tweak required on my side is not to rely on PHP is_resource() function. I think we can find a workaround for that. Now my 2 cents about the whole thing: when you provided some great benchmarks that helped me implementing the writing buffer feature in ShapefileWriter class, obviously I looked into the topic, because sure enough I wanted to boost performance using some kind of memory buffer when reading shapefiles too. Long story short, this is something very tied to PHP internal mechanisms and the operating system itself, meaning a binary fread call is (almost) never reading just the amount of bytes required, but some buffers are actually implemented under the hoods, both on PHP side (see this) and OS side. In this latter case, it might be extremenly tricky if not impossible to prevent the OS from actually buffering file reads. Then another consideration: a buffer works great in an ideal scenario, with sequential reads, not random ones. My library allows for random access to shapefies, but even worse, in shapefiles there can be quite some garbage between a record and another, making the use of fseeks necessary for each record, even when it's apparently reading it in a sequential fashion. My conclusion when I approached the issue was simple: no real performance benefit was gained implementing a simple reading buffer even in sequential reads compared to the OS native one and trying to write something more refined wasn't worth the effort in my opinion. Ultimately I gave up on that, but if you managed to find some different approach that actually boosts performance, it'd be great! Can I have a look at the actual code you wrote?

Projections and re-projections Yep, this is a request I often get from users. As you wrote, this is out of the scope of this library. One can (and should! ehehe) use proj for that, in the form of php bindings or somewhere else (typically within a database). For people who need it, I think the best is having 2 separate libraries, this one and proj4php for example, and a controller that use both of them to read and reproject geometries. Each one serving to its own scope, what an ideal world! ;) I am not planning to include any re-projection capability nor interfacing the library to external ones (I make a point about it being 100% standalone), nonetheless I could surely include some examples in the official documentation. Feel free to send me everything and I'll include a dedicated section in the docs, probably adding some PostGIS/SpatiaLite examples as well.

pablodegrande commented 3 years ago

I am sending here links to the 'paged' reader.... it reads blocks instead of bytes and allows for random access:

https://github.com/poblaciones/poblaciones/tree/db66311c00877a1399c190e8427d8f597df095b3/services/src/classes/shapefile You will find this "virtualfile" implementation (it may not be a proper name), and a modified shapefile.php file that instead of keeping track of handles, keeps track of my "virtual files". That should be ok to make benchmarks, and decide if it's worth it (my tests reduced load times to a half but it may depend on file size, hardware, relation between dbf and shp). The reader itself is quite clean, but i did not integrated properly to allow the modified shapefile.php to read and write... it's just a reader. A design to put all together would be still required.

About projections... I've implemented this code to reproject your array structures:

https://github.com/poblaciones/poblaciones/blob/db66311c00877a1399c190e8427d8f597df095b3/services/src/classes/Projections.php, It does (at ProjectGeometry) $geometry->getArray(), then creates new instances by doing: $class = get_class($geometry); $cloned = new $class(); and then it makes: $cloned->initFromArray($arrayProjected);

It could serve as an example of converting outside you library.

However, for 50k polygons, it takes a while to process... that's why I would have liked to have a way to inject the "ProjectPoint" function at some point in the loading process, to avoid rebuilding the whole object tree just to make the coordinates mapping work (i.e. to have a hook/callback somewhere you would still have your library standalone, which is a great feature... maybe the reader could have an optional 'translate' function? I guess all points that enters the loading process pass through "public function addPoint(Point $Point)" in multipoint... but i am not 100% sure about that).

gasparesganga commented 3 years ago

I've just give a very quick look at it (tomorrow and the day after I'll be away, so I won't be able to properly dig into it until the weekend, sorry).

First thought are that some kind of ShapefileFileReaderInterface is needed on my side. Your VirtualFileReader class will implement it and of course I'll provide a base class that encapsulate current logic, stripping it down from Shapefile class, to ensure current working operation. Not overly complicated and actually a good idea, especially in the long run, allowing people to code their own implementation like you.

About projections, I haven't looked at it too much, but why have you gone for this complex route? Wouldn't it be much easier and clean to rely on public Geometry::getArray() method, translate the coordinate array and then create a new Shapefile\Geometry of the same type and initializing it with Geometry::initFromArray()? No need to clone any object, performance should benefit.

PS: A quick thought out of the box. If ultimate performance is what you're after, probably a quick pass through SpatiaLite is your best bet. Read the shapefile with my library, feed geometries as WKT as provided by the library to an in-memory SQLite database (ST_GeomFromText), perform the transformation in there using PROJ natively from SpatiaLite (ST_Transform) and fetch everything for further steps. Cumbersome? Maybe, but it should be orders of magnitude faster than PROJ PHP bindings.

gasparesganga commented 3 years ago

Alright, I've been thinking about the first point and I believe the best and most elegant approach is indeed adding a FileInterface interface. The library will have its own default implementation and both ShapefileReader and ShapefileWriter constructors will accept an array of FileInterface instances other than a simple string, an array of strings or an array of resource handles as they do right now.

You can write your own implementation and pass the initalized objects to ShapefileReader constructor. What do you think about that? Actually I am halfway done writing the new code and modifying existing one, so you'd better like it :))

About the second point: as I wrote before, projections and re-projections are definitely out of the scope of this library. Trying to integrate anything within the library would open a huge can of worms, starting with proper PRJ file parsing (different shapefile generators are producing some wild files out there...). Have you tried any of the approaches I suggested in my last message?

PS: In case you absolutely wanted to stick with your implementation of Point class without the overhead associated with a new object instantiation, I might think about some way to inject geometry object classnames to ShapefileReader class, so that you could easily extend the base Point class and have ShapefileReader use it seamlessly. This is just a quick thought, I need to think about it carefully, I like the library codebase to stay as elegant and coherent as possible, you know ;)

pablodegrande commented 3 years ago

Hi! Sorry for the delay.

A FileInterface seems like a nice/low-impact solution. It could be great to make a few tests about what is a good default buffer size for paged reading.

About the ways to avoid an overhead for projections (and get into a more concise user code), may be to add to all get* methods (getArray, getJson, etc) an optional parameter for translations/projections representing a callback (a "callable", such as in usort (https://www.php.net/manual/es/function.uasort.php).... such as: (changes in bold)

https://github.com/gasparesganga/php-shapefile/blob/master/src/Shapefile/Geometry/MultiPolygon.php

    public function getWKT(**callable projectionFn**)
    {
        $ret = $this->wktInitializeOutput();
        if (!$this->isEmpty()) {
            $parts = [];
            foreach ($this->getPolygons() as $Polygon) {
                $rings = [];
                foreach ($Polygon->getLinestrings() as $Linestring) {
                    $points = [];
                    foreach ($Linestring->getPoints() as $Point) {
                        $points[] = implode(' ', $Point->getRawArray(**projectionFn**));

where projectionFn would be a callback that receives a point (or x, y, z) and returms the point transformed (projected, with an offset, or wathever transformation it could require...

Another similar strategy would be to add a ->Transform method to all geometries, that may receive a similar callback and could iterate through internal point info applying the call to the callback for all points.....

gasparesganga commented 3 years ago

No problem, I had time to code and deploy that FileInterface and default File implementation in the meanwhile. Now everything you need to do is having your VirtualFileReader implement it and initialize a ShapefileReader like that:

$ShapefileReader = new ShapefileReader([
    Shapefile::FILE_SHP => new VirtualFileReader('path/to/file.shp', false),
    Shapefile::FILE_SHX => new VirtualFileReader('path/to/file.shx', false),
    Shapefile::FILE_DBF => new VirtualFileReader('path/to/file.dbf', false),
    Shapefile::FILE_PRJ => new VirtualFileReader('path/to/file.prj', false),
    Shapefile::FILE_CPG => new VirtualFileReader('path/to/file.cpg', false),
], [
    /* Other options here */
]);

You can use master branch to perform some tests, I'll release it with v.3.5.0 as soon as we're finished with everything here ;)

By the way, I've given a look at your VirtualFileReader class and I think you should change that === to an !== in the isOpen method.

Now, onto the projection topic (maybe 2 separate issues would have helped keeping things cleaner? No big deal though).

Forget about the first comment, it turns out it was exactly what you were doing already (getArray --> transform --> initFromArray), my bad, sorry! Have you given a try to the SQLite/SpatiaLite route? Surely not the most straightforward one but it should be definitely faster than any PHP PROJ binding.

Your latest suggestions about adding a parameter to output methods or even some kind of transform() method to geometries is interesting. Probably, should I go this route, I'd add some kind of interface to be implemented instead of a bare callable. I love functional programming but I'm trying to keep this as much OOPish as possible :)

Have you already thought about the pros and cons of both scenarios? I mean:

Making all output methods (getArray, getWKT, getGeoJSON) able to transform geometries, without permanently modifying stored coordinates.
Adding a transform method to Geometry abstract class, that would actually modify its coordinates.

pablodegrande commented 3 years ago

Thanks for the isopen fix. True about better with two tickets =)

I did not try the sqlite strategy mainly because even if it would be performant, I am afraid it would add yet more apis to the code, and i think that may reduce maintainability to future developers... and even when i care about performance, it's not the main goal of the app to import or transform files... so I try to have same balance there.

I believe "Adding a transform method to Geometry abstract class" may lead into mover flexible user code... if someone wants to apply two or more transformations, he could make them in different calls.

With the getArray/getWkt thing, everything should be done at once... and (second problem) it mixes the export abilities to the transformation ones (e.g. if someone would want to open a shapefile, reproject and save to a shapefile again [a reasonable user case] it me be in trouble with this get* methods, while Transform method would be more than ok to so....).

If you would add interfaces to a ->Transform(ITransformer transformer) method, it may be ok to distinguish between IPointTransformers and IGeometryTransformers... (projections would be IPointTransforms, as one don't care about the whole shapes or rings... just points).... If your are interested in examples other than projections, all css-transform cases may apply... (ej. scale the geometry, move the geometry)... Although I thinks projections would be the most common need in a shapefile library....

gasparesganga commented 3 years ago

Hi there! Sorry for the delayed answer, unfortunately my very strict work schedule and private life sometimes gets in the way of open source projects, you know ;)

So, I've been thinking about it, and probably the simplest approach is the best one, meaning a trivial PointTranslator interface whose instance should be passed to a brand new Geometry::translate() method. Implementations of such interface should just provide a translate method accepting a Point object and returnig another one:

namespace Shapefile\Geometry;

interface PointTranslator
{
    /**
     * Translates a Point.
     *
     * @param   \Shapefile\Geometry\Point   $Point
     *
     * @return  \Shapefile\Geometry\Point
     */
    public function translate(Point $Point);
}

Final usage would be something like: Geometry::translate(PointTranslator $PointTranslatorInstance).

The idea is that any geometry translation would come down to a simple point coordinates translation, thus let the libraray handle everything under the hoods while everything external translators need to know is how to translate a point. If one's code is able to translate geometries as a whole, for example using WKT as input/output format, he'd better use proper Geometry::getWKT() and Geometry::initFromWKT() methods, right?

Now, onto the less obvious things: Detecting incongruences between isZ() and/or isM() source and translated Points is trivial and a ShapefileException can easily be triggered, no big deal here. But... what should happen when translating a Polygon or MultiPolygon according to their second and third constructor parameters $closed_rings and $force_orientation? The whole library is designed in a way that established geometries cannot be altered. At first thought I'd say that both checks and eventual forcing should be performed again after a translation, according to the original constructor parameters value of the Polygon/MultiPolygon being translated. Does it make sense?

By the way, what about the File interface? Could you make some benchmarks with buffered reads already? I am curious about the results.

pablodegrande commented 3 years ago

The PointTranslator seems great.

About the PagedFileReader, it would be nice if it could load by default for reading files. I made benchmarks, selecting different shapefiles. Below is the table showing how large the dbf and shp files were, the # of rows and how many lines per second the library fetched.

In all tests the file was most probably in memory by the OS (windows), because I triggered a full read before the test to have that effect "controlled". I think that is realistic for web scenarios (i.e. when uploading a zip with shapefile and expand, or upload a file as attachment, php writes to disk and the OS may keep that in filesystem's caches). Additionally, I would have had to restart the computer before each benchmark to test otherwise, and I did not wanted to =)

The main result is that there is no benefit on my code. When I first implemented the reader, I measured improvements while importing the file in the my local (dev) webserver. However, running the tests on a sample of different files, from a shell of my windows desktop pc (16gb ram) and in a shared linux hosting, both lead to better results in the raw fread() implementation.

After such results, I would abandon the IFile interface, unless you would want to run further tests before discarding it.

gasparesganga commented 3 years ago

Pero, no habíamos quedado en que con ese buffer se podía leer 2 veces más rápido? Si fuera malo, ahora sería un buen momento para soltarte un mira que te lo dije... jejejeje

Honestly, these results don't suprise me too much: when I added the writing buffer feature to ShapefileWriter class (by the way, that was thanks to your input, feedback and accurate benchmarks, thank you!), I also tried something similar with a reading buffer for ShapefileReader class. Altough it was definitely raw and not refined, initial outcomes suggested that an approach like that was redundant using PHP file stream resources, because the own operating system (and PHP itself too) was already buffering things in the background. I haven't tried with different types of stream resources though (ie: a remote url?).

But worry not, I like this File interface and I think it's a nice addition to the codebase, decoupling the physical disk files from the library. Maybe someone will be interested in writing an in-memory implementation or something else more exotic, why not? Surely, using the default StreamResourceFile is not impacting negatively performance compared to v.3.4.0, so why dropping it? It adds more versatility to the library, I like the idea! ;) Moreover, it works for both reading and writing, thus an in-memory implementation could make even more sense coupled with ShapefileWriter class.

Now, onto the PointTranslator. These days it seems work is overflowing so I didn't have too much time to think about it honestly. I feel there could be some drawbacks and I want to think about it carefully and plan it accordingly. On the outside (public methods) this library offers a decent OOP interface (I hope!), but there are a few quirks here and there that I had to use in order to boost performance under the hoods and I fear that breaking the established points in geometries won't change ever principle might break something.

pablodegrande commented 3 years ago

Jaja... sí. Apparently the previous improvement was due to specific features of the file I was using. After all, that's why benchmarking with many different type of files makes sense.

It seems ok to think carefully about mutating or not the points. Hope you get into something elegant for the code inside and outside the library =)

DarkSide666 commented 2 years ago

Hello guys! This library is great! Thanks for a great work of yours.

To add my 2 cents in this topic - PointTranslator is still really needed feature. Without that it's close to impossible to do translations without extending or overwriting some methods. I tried to put my translation code in Geometry/Point->init() method, but because it's private, then can't normally extend it. So, yeah, legit way to do simple point translations would be really appreciated. In my case it's coordinate system transformation (which is defined in PRJ file) what is required to add ...

gasparesganga commented 2 years ago

Hi there. I let these features hanging around (actually, I committed the changes for File Interface long ago) due to an incredible full working schedule and other personal stuff, but this library's maintenance and developing is always in my priorities list.

I will try to finalise it by the end of the month or so. As usual, one of the most boring and time-consuming parts is writing some good documentation about new features...

@DarkSide666 Using the PointTranslator Interface concept, the library won't need to know anything about the actual transformation, so once you know how to handle it for a single coordinate pair (or triplet), it will work like a charm.

DarkSide666 commented 2 years ago

Yeah, that's exactly what I need - legit ability to do that myself by defining my own translator (with your defined interface) and passing it to your library. BTW, also bounding box need such ability, because looks like it isn't using Point class, but still contains coordinates which should be translated. Thanks a lot !