Closed rkrashr closed 5 years ago
Hey, that totally makes sense. Furthermore, the iterable does not have alteration methods, therefore, they are perfect to read-only operation. I'll move this discussion also to the email list, please, stay tuned.
Thank you for the feedback
It’s been on my mental back-burner to try out some lazy-loading options in my driver. I’ve considered if it’d work to make a List implementation that lazy-loads internally, since that’d keep some handy methods that Iterable doesn’t have. On the other hand, maybe Stream would be most idiomatic now, and could have efficient count, etc. operations while also having a stronger implication of lazy-loading.
-Jesse
Hey @otaviojava , thanks for you comments. I am new to this project, just thought this would be useful. Stream would work in a generic way, I would leave it as iterable or rx observable. This way it is easier to handle exceptions
Ok, I checked several drivers and yes. Usually, they return as Iterable. I have on my mind two solutions both it will change the result to all search results.
We can change the result to Iterable itself, e.g.:
Iterable<DocumentEntity> query(String query);
What I propose it to have an interface to represents this result as iterable, therefore:
import java.util.Objects;
import java.util.function.BiFunction;
import java.util.function.Function;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
public interface Result<T> extends Iterable<T> {
Stream<T> stream();
static <E, T> Result<T> of(Iterable<E> entities, Function<E, T> converter) {
Objects.requireNonNull(converter, "converter is required");
Objects.requireNonNull(entities, "entities is required");
final ResultSupplier<T, E> supplier = ServiceLoaderProvider.get(ResultSupplier.class);
return supplier.apply(converter, entities);
}
interface ResultSupplier<T, E> extends BiFunction<Function<E, T>, Iterable<E>, Result<T>> {
}
}
The new return:
Result<DocumentEntity> query(String query);
So:
//database provider side
Iterable<DatabaseEntity> entities = ...;//native query
Function<DatabaseEntity, ColumnEntity> converter = ...;//the translate logic
final Result<ColumnEntity> result = Result.of(entities, converter);//return this value to the user
//lazy value
for (ColumnEntity columnEntity : result) {
System.out.println("");
}
//return as Stream
final Stream<ColumnEntity> columnEntityStream = result.toStream();
@rkrashr That is an amazing way to start! Thank you.
On the topic of working with iterables instead of lists: I like having either iterable or stream as the default internally, and then having the intermediary code able to convert between them. The nice thing there is that it could end up being whatever is most efficient for the driver... in Darwino, the internal Cursor maps pretty closely to Stream operations, so it could be very efficient implementing that directly Maybe ideally it'd be up to the driver: default to Iterable as the standard mechanism of exchange and use StreamSupport conversion as the default, but also have the option for drivers to produce customized Streams
by @jesse-gallagher
Yes, looks good to me - easy to take current List-based implementations and switch to that, but then also smooth to add in efficient specialized versions.
Having lazy collections in general would take care of one of the places in my apps where I have to dip into the native API - namely, getting collection counts. The more I can work with Repository
objects, the better.
I've created the first PR: https://github.com/eclipse-ee4j/nosql/pull/19
@keilw what do you think?
@otaviojava @jesse-gallagher Can you explain the need for a duplicate of List
or Collection
? Hazelcast or Apache Ignite offer distributed implementations of List
, Set
or both, so not sure, if duplicating parts of the Java Collections API makes NoSQL more attractive. Collection
also is Iterable,
so where is the need to introduce a custom implementation or extension to that?
@keilw
We are not duplicating Collection; indeed, we're using Iterable instead.
That is because several NoSQL database uses the Cursor pattern, that means, it retrieves the information lazily instead of putting everything in the memory. The idea of using Iterable instead of Collection is to keep to use this behavior in the API. Furthermore, Collection has a method to either add or remove an element, but, once the result is read-only, the Iterable is suitable. I believe that why several NoSQL drives use it.
Vendors such as:
However, I would strongly argue against the toStream()
method, even if for some reason we stick to Iterable
only. The behavior should be similar to Collection
.
See the JavaDoc saying "Returns a sequential {@code Stream}
with this collection as its source." This is pretty much the same as the toStream
method, so why not call it stream()
instead?
Whether or not we need a parallel stream we can see.
Ok, that makes sense.
@keilw Done,
/**
* Returns a sequential {@link Stream} with this result as its source.
*
* @return a {@link Stream} from the result
*/
Stream<T> stream();
@rkrashr @jesse-gallagher @keilw
I found an issue with this path. Iterable allows me to run the iterator multiple time and that does not happen in several vendors because it uses a cursor.
My question: What do you think: instead of using this Result interface we use Stream directly?
We can create a utilitarians class to help the driver from an Iterable create a Stream that is not a big deal.
So:
Stream<DocumentEntity> query(String query);
That's a good argument for implementing Stream
as the default lowest level. And, like you say, the important thing is that these are all largely interchangeable if desired. So List
-based implementations now are good with Iterable
and Stream
, and then either of those two could convert to the other pretty easily. It'd just be up to the driver's choice to pick the one that is most efficient for it.
From an Iterable I can create a Stream easily:
Stream<?> stream = StreamSupport.stream(iterable.spliterator(), false);
To drivers case:
Iterable<NativeEntity> nativeEntity = nativeEntityResult;
Stream<DocumentEntity> stream = StreamSupport.stream(iterable.spliterator(), false)
.map(convertToDocumentEntity());
And it will execute when there is the terminal method, so lazy.
About keep both options, that looks a challenge. What do you have in mind?
That's exactly what I mean - in the cases where a driver is best off using Iterable
internally, it's easy to convert it to a naive Stream
. Then the ones (like mine) that would fit Stream
better, that gives the most capability and flexibility.
I know, the team is a bit smaller so far (as many Jakarta EE specs btw which are sometimes only worked on by 2-4 people ;-) so wonder, if we could run this by a few people from the old JPA EG (https://www.jcp.org/en/jsr/detail?id=338) like @lukasj (Jakarta Persistence Lead), @emmanuelbernard (also Hibernate), @alexsnaps (he forked Hibernate OGM and he was also at JCrete, although I don't know if he joined the Jakarta EE sessions) or @odrotbohm who has already made suggestions here earlier?
@jesse-gallagher @jesse-gallagher @rkrashr
Could you review this please? I think that is a good first step
I agree that forcing the API into List
might be a bit invasive, especially when it comes to large datasets as that means that you'll have to read all data into objects first before you hand them all out as result. This might result in long execution times and higher memory usage. In fact, the latter can't even be countered with asynchronous executions via a CompletableFuture
as that still has to prepare the entire list. If the work on a particular thread takes a bit of time, chances are high that you're exhausting the thread pool pretty quickly and are back to blocking behavior.
Ultimately, we think reactive APIs are the right approach as that is able to really stream data out of the store, enable continuous queries and at the same time keep the memory profile low, even if you process GBs of data. That said, I totally understand that this is probably out of scope for at least right now. Also, this changes the programming model completely and users might not be eager to get into that in the first place.
So for Spring Data staying with Iterable
in the core interfaces had a couple of upsides and downsides:
List
to implementations.List
as return types by simply redeclaring the method making stronger guarantees about the returned value. Users can then refer to the store specific type that made the re-declarations if they absolutely need List
in their code. However, this limits portability between stores but API-level portability has never been a primary goal for Spring Data.Streamable
that makes up for the lack of methods to easily stream over the result on Iterable
and provides basic functionality to map over the results etc. We don't use it in our type signatures (yet) but users are free to redeclare methods with it and the underlying proxy will adapt to it naturally.Iterable
and even more so Stream
as return type avoids the collection of objects into a list in the best case, it raises the question about the usage of that Iterable
and how to properly handle resources required by the Iterable
. Let's take JPA's Query.getResultStream()
for example: to eventually process the results contained in it, the underlying EntityManager
, database connection etc. still has to stay open. Also, we need a signal that the consumption of the stream has ended, to know when we actually have to release the handles. I.e. Stream
instances returned from repositories have to be used within a try-with-resources block (to get the ….close()
signal) and the Spring specific @Transactional
method, as that controls the lifecycle of the EntityManager
. This is quite a constraint to how users can actually make use of Stream
and it's also tricky to properly prevent misuse. If user code hands out the Stream
out of such a code block, it will happily process a few results depending on how long the closing of the actual database connection will take. This leads to very subtle bugs which look totally weird from the outside. Even worse, with a connection pool in place you might still run a query on a connection that's already returned to the pool and see weird errors from a different thread that by accident got this particular connection handed out from the pool.
In Spring Data, we at least check that the invocation of an Stream
method has to happen within a transactional method. Be sure to understand that @Transaction
does not necessarily mean database transaction here but rather unit of work that just controls resource lifecycle (EntityManager
, database connection etc.).
Thanks for the detailed input. Especially the last part also has some relation to ideas we discussed here or in different places (e.g. MicroProfile mailing list or the JCrete Unconference) how this spec and a possible subset of JPA could interact with each other or build upon each other in Jakarta EE from 9 on.
Thank you @odrotbohm For the feedback, I really appreciate your time to write it. Yes, that is a tough decision. If we go to the stream way we need to make it clean through documentations.
I'll admit that I've only used Stream
fairly straightforwardly to date, but would it suffice to decide that terminal operations are expected to have the side effect of closing the backing resources?
The iterator()
method would be a weird one there, since it's officially terminal, but a user would reasonably expect it to still be lazy-loading from a DB perspective. Even then, that method is explicitly called out in the Javadocs as being weird: "Only the terminal operations iterator() and spliterator() are not; these are provided as an "escape hatch" to enable arbitrary client-controlled pipeline traversals in the event that the existing operations are not sufficient to the task."
@odrotbohm, very well put I would not call handling open resources a con though. It just sets the requirements for handling associated resources and makes sure they are properly closed and disposed when stream is drained. And this includes an associated cursor, but not associated db connection And roughly, the user-side code goes like this:
try (connectionManager = openConnection()) {
Stream<?> dataset = conectionManager.build().stream()
dataset.forEach(System.out::println)
}
but not like this:
Stream<?> dataset;
try (connectionManager = openConnection()) {
dataset = conectionManager.build().stream();
}
dataset.forEach(System.out::println);
@jesse-gallagher
You can use the same concept that you're doing with Iterable. So, when there is not the next element, you can close it. So, we will face the same issues with both Stream and Iterable.
DocumentCollectionManager::search produces a list. Do we assume that result will always comfortably fit in memory? In NoSQL world, it often doesn't and this prevents the whole (very convenient) API to be used at all.
We need a separate search implementation that returns an iterable, or a stream, or an observable
This would be easy to create from existing search implementation. Looking at drivers, most of the time it does iterable -> stream(iterable).collect(toList)