NetTopologySuite / NetTopologySuite.IO.ShapeFile

The ShapeFile IO module for NTS.
35 stars 25 forks source link

Shapefile IO refactoring #23

Closed xivk closed 4 years ago

xivk commented 7 years ago

I want to change the way the shapefile read/write works now:

Anyone have any feedback? Should I do this outside of the main NTS project, create a project/package in a new repo?

airbreather commented 7 years ago

Have a package for this only without dragging along the rest. Now you also get, Postgis and MsSqlSpatial using the NetTopologySuite.IO package. Perhaps even with only one dependency, NetTopologySuite.IO and nothing else.

Can you explain what you mean further, please? The NetTopologySuite.IO package only depends on NetTopologySuite (and thus GeoAPI).

Or are you suggesting that there should be some kind of additional package NetTopologySuite.IO.ShapeFile or NetTopologySuite.IO.GeoTools or something, which only depends on GeoAPI? If so, I like this idea, if you can rip out its uses of other stuff from NetTopologySuite proper.

From memory, I think the only major thing the NetTopologySuite.IO.GeoTools uses from NetTopologySuite proper is the stuff from the NetTopologySuite.Features namespace, so it sounds possible. Doing it in a backwards-compatibility would be a nuisance.

Refactor to support some netstandard version.

I agree that shapefile reading / writing support in .NET Standard is important enough to justify. Unless you get really aggressive with it, I think the lowest you can do is .NET Standard 2.0 (especially for the use case that I think you have in mind, which requires all the APIs that make me say that).

Remove code that has nothing to do with shapefile writing/reading.

Looking at NetTopologySuite.IO.GeoTools, I don't really think there's all that much of this... am I looking in the wrong place?

xivk commented 7 years ago

There is IO.GeoTools but it has a dependency on IO.ShapeFile. I think we need to define what does what here too because it's a bit confusing having this:

Not suggesting only depending on GeoAPI, I don't think that's a good choice because we need the geometries in NTS.

This code I don't think is IO and dumps a shapefile into a database:

https://github.com/NetTopologySuite/NetTopologySuite/blob/develop/NetTopologySuite.IO/NetTopologySuite.IO.GeoTools/Shapefile.FullFat.cs

Not sure where this would go, does it even belong in NTS?

I think we should have one package, on top of NTS, that supports IO for shapefiles:

I suggest removing the functionality from GeoTools and adding it to ShapeFile.

airbreather commented 7 years ago

There is IO.GeoTools but it has a dependency on IO.ShapeFile. I think we need to define what does what here too because it's a bit confusing having this:

  • IO.ShapeFile: ?
  • IO.GeoTools: Reads/Writes shapfiles.

Got it... makes sense.

This code I don't think is IO and dumps a shapefile into a database:

https://github.com/NetTopologySuite/NetTopologySuite/blob/develop/NetTopologySuite.IO/NetTopologySuite.IO.GeoTools/Shapefile.FullFat.cs

Not sure where this would go, does it even belong in NTS?

I see now. Agreed, the SQL stuff in that file looks like crap.

I think we should have one package, on top of NTS, that supports IO for shapefiles:

  • Without a dependency on System.Data, this is not needed to read/write the geometries and attributes.

This was what I meant when I said "really aggressive", particularly the "and attributes" part. ShapefileDataReader was designed to surface the results using the interface that already exists in System.Data.IDataReader. This design dates back to 2002-2003 in Geotools.NET (which, not too coincidentally, was written by some other employees of the company I currently work for). I don't think that's the best design possible, but it has its own advantages like being able to use it with APIs that consume arbitrary IDataReader instances.

Notably, I think it's a big deal that the *.dbf format is very much a good match with IDataReader.

Anyway, redoing that means a lot of rework and finding an alternative design that works, especially if you want to maintain backwards-compatibility since you'd probably have to at least consume that other design in the shape of IDataReader for unusual kinds of consumers.

  • Only a dependency on NTS.IO package.

I'd personally put it into the NTS.IO package, not a big difference.

  • Platforms we have now + netstandard, we should be able to add the System.IO.FileSystem package as a dependency and support netstandard1.3

It's already used in the NetTopologySuite.IO project / DLL, in ShapefileStreamProviderRegistry.

I suggest removing the functionality from GeoTools and adding it to ShapeFile.

Agreed... actually, @FObermaier's Refactor_for_NETStandard branch had an idea that I really liked, which was to combine all the NetTopologySuite.IO.* DLLs that go into the NetTopologySuite.IO package into a single NetTopologySuite.IO project and assembly. While we're already making a breaking change, we might as well go all-in and combine these assemblies together like that, eh?

xivk commented 7 years ago

I'll start on a pullrequest somewhere next week. I agree that we should include all we can in NetTopologySuite.IO but only when there aren't any extra dependencies required, for example Newtonsoft.Json for the GeoJson IO. should stay seperate.

LennartRoeder commented 7 years ago

I'd like to emphasise that shapefile support is THE feature I want from NTS, now that we have .NET Core compatibility, so I am looking forward for this ticket to get resolved . Keep up the good work!

peetw commented 7 years ago

Thumbs up to all of this :) Also, would it be possible to extract an interface for the ShapefileDataReader (i.e. IShapefileDataReader)? This would make it much easier to mock out during testing (unless I've missed something obvious?).

xivk commented 7 years ago

It has taken a bit longer than I expected but I created a small separate project here:

https://github.com/itinero/nettopologysuite.io.portableshape

This supports netstandard1.3 and has no dependencies other than the main NTS package so it can easily be made into a pull-request for the main NTS repo here. But it's a pretty big change, it would mean removing IO.GeoTools and converting IO.ShapeFile.

Let me know what you all think so I can start working on a pull request.

In the meantime feel free to test the package containing all this:

https://www.nuget.org/packages/NetTopologySuite.IO.PortableShape/ (will be removed again after merging here)

airbreather commented 7 years ago

https://github.com/itinero/nettopologysuite.io.portableshape

I've briefly looked over this... it looked like the main thing you did was take out the System.Data interface implementations, but leave the rest of the design more-or-less intact. Is that accurate?

If so, I'm not totally sure about this kind of a change. The legacy design is really awkward, and it kinda has to be in order to fit the shape of System.Data.IDataReader, and then the System.Collections.IEnumerator / System.Collections.IEnumerable part makes it even weirder (IIRC, .NET 2.0 wasn't out yet when Geotools.NET was written, so it couldn't have been System.Collections.Generic.IEnumerable<T>).

Perhaps a better course of action might be to just look into making the existing design work with the netstandard2.0 target framework, and then invest in designing / developing an alternative performance-friendly API for reading/writing shapefiles? Ideally (IMO), the next significant breaking change should probably resolve #154, assuming that's still a problem, and netstandard2.0 would probably cover the majority of applications who are looking for something today. All this, of course, assuming that netstandard2.0 gives us enough to minimize the effort of that part...

In theory, it would be possible to make a lower-level API for shapefiles, and then the current NetTopologySuite.IO.GeoTools API could be reimplemented to sit on top of that, whereas things like Itinero with its (I assume) multiple gigabytes of shape (meta/)data could benefit from the lower-level API. I'm thinking something focused around Span<T> / ReadOnlySpan<T> would make sense, because the ESRI and dBase formats seem to lend themselves nicely to that kind of thing.

xivk commented 7 years ago

I'm in favour of a pragmatic approach and take what we have now, simplify it and add some extra functionality making it easier to work with.

Yes, I basically took out System.Data and added a Feature, and Attributes property to the reader to get a feature per row without having to write your own code. System.Data doesn't really add value. I also concentrated all the code needed in one project but I had to take code from 3 different projects in the original NTS code, most of the code is used exclusively for shapefile IO and should be removed. So, basically I cleaned up a lot of the code just by ripping out the shapefile IO functionality.

We could also make the reader implement IEnumerable<IFeature> easily now, that would make all this a lot easier to work with already.

We can now have one method that takes a filename and returns an IEnumerable<IFeature> by installing a package with no other dependencies other than NTS. Writing a shapefile was already pretty easy I think.

The risk of a full redesign is that it will never happen. I'm not willing to put in the effort. Taking in this code, improving it a little bit, releasing it, doesn't stop a redesign in the future. We could also leave the IO code in seperate repos and let it have it's own release cycle.

As for issue #154, we can work on that now because the code has been cleaned up.

I think the advantages are clear:

Basically I think we should remove IO.GeoTools and replace IO.ShapeFile with this new project or keep it a seperate project but then under the NTS organization and remove IO.GeoTools and IO.ShapeFile here.

airbreather commented 7 years ago

System.Data doesn't really add value.

I'm not sure I agree... I can easily imagine someone on .NET 4.x using System.Data.SqlClient.SqlBulkCopy to load data from a shapefile to a SQL Server database. I've certainly done similar things myself. Unless this is completely broken for that use case, I hesitate to write it off outright.

Taking in this code, improving it a little bit, releasing it, doesn't stop a redesign in the future.

The main thing I'm concerned about with this is that consumers that use the existing shapefile API would almost certainly have to adapt to this change, for questionable benefit. IMO, if we're going to (potentially) break downstream consumers, then it should be for a good reason, especially if we're not offering them anything new, especially-especially if the API is still going to be a clunky mess.

The risk of a full redesign is that it will never happen. I'm not willing to put in the effort.

Agreed, that's the risk. As things stand right now, do you see any reasons why it wouldn't work to just add netstandard2.0 as a target platform? Last I checked, it looked like most or all of the APIs that kept NetTopologySuite.IO.GeoTools away from netstandard1.x were added in netstandard2.0. I haven't done a proper review, but assuming there's no particular reason why that wouldn't work, that feels like a safer starting point.

xivk commented 7 years ago

System.Data was just used to implement one interface, IDataReader. So still not convinced, if we take a dependency it should, as far as I'm concerned be related to the core goal of the package, and that is reading and writing shapefiles.

As for the API, nothing has changed that's related to the core funtionality of what this should do, read/write shapefiles. The only potentially breaking change is that netstandard doesn't have a default encoding because it doesn't define a platform. That will remain the case for netstandard2.0.

Enabling bulk copy could be a sample project we publish, it's perfectly doable without having a system.data dependency. Otherwise that would imply we would need to implement IDataReader in all IO packages.

airbreather commented 7 years ago

FWIW, 4b18538 makes NetTopologySuite.IO.GeoTools and NetTopologySuite.IO.ShapeFile.Extended build for me on .NET Standard 2.0, with the only public API difference being that this import method won't show up on the .NET Standard 2.0 target.

Not that this method couldn't be made to work (it wouldn't even be that hard), just I didn't want to add the package reference to System.Data.SqlClient that it would take to make it work.

airbreather commented 5 years ago

Can this be closed, or should I move it to the NetTopologySuite.IO.ShapeFile project?

xivk commented 4 years ago

Looks like this can be closed! Excellent work being done these days in NTS! :100: