dvidelabs / flatcc

FlatBuffers Compiler and Library in C for C
Apache License 2.0
652 stars 186 forks source link

Add "scan_by" methods to reader interface #23

Closed skhoroshavin closed 7 years ago

skhoroshavin commented 7 years ago

Sometimes order of items in array is important, yet it there could be need to find some item in array by some key. In this case the only option is to perform linear search, which is still okay for small number of items (and small number of items seem to be quite frequent case for flatbuffers). Writing search functions manually can be tedious, so it could be great if flatcc generated them automatically, either giving them suffix _unsorted, or generating them in place of usual binary searches when some key is given to flatcc. Or maybe some other way (ukey attribute, instead of key?)

mikkelfj commented 7 years ago

Flatbuffers schema has a key attribute but it is only allowed on one table field. The flatcc schema parser actually allows for multiple key fields and generates find_by_... accessors, as well as just find for the first key.

The builder does not sort arrays because of the emitter interface that supports custom streaming and avoids storing the full buffer in memory at once. The arrays are aso unsorted to support custom choice of which field to sort by, if at all. Instead the reader interface has a cast to mutable and a sort operation on mutable arrays. There is a sort for each keyed field. The sort is an in-place unstable heap sort that does not require extra memory. The is documented in the builder.md document and the reflection api (bfbs) implementation makes use of this because the binary flatbuffers schema has sorted arrays.

In addition it is possible to manually sort arrays by other criteria and implement own binary search or hash lookup methods - but this is beyond the scope of the core flatcc implementation. That said, support libraries could be developed for this purpose.

Note that sorting is unsafe even with a verified buffer - you need to be able to trust the buffer 100% or to first rewrite the buffer due to certain attack vectors related to overlapping fields.

mikkelfj commented 7 years ago

Giving it some more thought, it would be possible to add a scan_by method to the reader interface. This could be done on all keyed fields. It would be logical when supporting multiple keys, since at most only one of these would be sorted.

While the binary search is non-trivial due to the macro system, a scan should be able to reuse the comparison logic and therefore be relatively straight forward to implement.

If you are up for it, without any modifications to the schema, please have a go.

It also makes sense when receiving external buffers that are not sorted, and where sort is not safe due to the mentioned attack vector.

skhoroshavin commented 7 years ago

Thanks, I know about sort helper functions, I already read all available documentation on both flatbuffers and flatcc project. My main concern was not about arrays being unsorted by default, but about keeping them unsorted, and still being able to find some element by key. For example, order of stored fields of some structure definition is important, and yet there could be need to search for some field by name. However, on the second thought in this specific case fields will likely have something like offset property, and can be sorted back into original order when needed. And even if they don't have one, it can be easily added. So, having sort and binary find functions for different fields is flexible enough for any task.

After seeing your update: scan_by naming looks clear enough, and still will be more efficient (and easy to use) than sorting arrays every time different find is needed. So, I'll put it on my todo list, thanks :)

mikkelfj commented 7 years ago

I was wondering if it makes sense to use a start offset for scan so all matches can be found incrementally. An end offset could also be added but that might be overengineering.

mikkelfj commented 7 years ago

Also keep in mind that a common way to index flatbuffers is to use a separate integer array as index. This approach can also be used for external references mentioned in your other issue: https://github.com/dvidelabs/flatcc/issues/22 The binary schema does use this for some indexing purposes.

mikkelfj commented 7 years ago

BTW: any thoughs on adding a separate integer array as hash using fnv1a and linear open addressing and a requirement that the array is at least to size of the indexed vector, and preferably supports a load factor of no more than 0.7?

Such an array might have an attribute, stating which array is being hashed. The array being hashed would have key as with binary search. We'd have to discuss this with the main flatbuffers project.

In the external subdirectory of the source tree is a hash table implementation that does not store keys in-place and while it cannot be used directly, the concept can.

skhoroshavin commented 7 years ago

I was wondering if it makes sense to use a start offset for scan so all matches can be found incrementally.

Yes, it does make sense. Thanks for mentioning this.

Also keep in mind that a common way to index flatbuffers is to use a separate integer array as index.

Didn't quite get what you mean. Could you give a link to example code?

any thoughs on adding a separate integer array as hash using fnv1a and linear open addressing

If I understand correctly this is basically storing something like "database index" along with data itself? This seem useful for case when there is need to decode data as fast as possible, and wire size not an issue. Otherwise it would be better to send just data and implement indexing on receiving side.

In the external subdirectory of the source tree is a hash table implementation that does not store keys in-place

I'll have a look. Btw, in my pet project I implemented hash with array-like interface for keys which also uses open addressing. In my implementation keys are inplace, but they can be easily made external when modification is not needed. It also have bulk insert which is fast and dead simple. The only catch is it needs two arrays, not one, but on the plus side it doesn't need linear probing in case of collisions, so performance should be better.

Also, if implementing mutable reader interface with mutable contexts mentioned in #22, this hashing can be performed automatically on load.

mikkelfj commented 7 years ago

Also keep in mind that a common way to index flatbuffers is to use a separate integer array as index. Didn't quite get what you mean. Could you give a link to example code?

I was mistaken wrt. reflection it uses a DAG - I don't have an example off-hand - but it can be used when DAGs are not wanted, e.g. due to JSON transport. But from what you write below, you got the idea. reflection still uses integers for references, but only as single element references e.g. an array type of some other type references the type by integer index AFAIR.

any thoughs on adding a separate integer array as hash using fnv1a and linear open addressing If I understand correctly this is basically storing something like "database index" along with data itself? This seem useful for case when there is need to decode data as fast as possible, and wire size not an issue. Otherwise it would be better to send just data and implement indexing on receiving side.

Yes, it depends on the use case. Building a hash table may cause performance spikes and if your data is memory mapped pre-indexing may be preferable, but for on-wire transport it is a waste. An index may also be stored in a separate flatbuffer - this is especially relevant when stacking multiple flatbuffers in the same mmaped file. Support methods could build a hash similar to sort, and perform lookups taking a pointer and size of the hash table. Not sure it is worth cluttering flatcc with that, but since it may make use of the key comparisons, it might make sense.

In the external subdirectory of the source tree is a hash table implementation that does not store keys in-place I'll have a look. Btw, in my pet project I implemented hash with array-like interface for keys which also uses open addressing. In my implementation keys are inplace, but they can be easily made external when modification is not needed.

The saved space generally pays of very quickly in cache-line density. So much so that you can afford larger tables with a lower load factor instead of trying to be clever.

It also have bulk insert which is fast and dead simple. The only catch is it needs two arrays, not one, but on the plus side it doesn't need linear probing in case of collisions, so performance should be better.

I recently played with a lot with linear probing, and it really is one of the fastest way to do it, if you make sure to keep the load factor at 0.7 at the most. I'm not sure how you can avoid collisions unless you implement chaining or perfect hashing. But, collisions are really cheap if done right - I'm down to a few nano seconds on average for small tables, and 30ns once 1.level cache is exhausted. Robin hood hashing and other fancy schemes don't pay off because they consume too much space - they may pay off when you delete entries frequently because of tombstones, which you don't do here. In fact, you can optimise out checks for tombstones. As I recall, quadratic probing isn't worthwhile either, at least with a reasonable load factor.

Also, if implementing mutable reader interface with mutable contexts mentioned in #22, this hashing can be performed automatically on load.

I was, a some point, considering generating code for sorting arrays automatically. It is easy to do, just need to traverse to buffer and hit sort appropriate places. But I didn't find it worthwhile to implement, nor to have the extra file generated. Hashes are bit more involved if you require separate memory. You suggestion appears to be doing it lazily which is also an option, although it may cause unwanted performance spikes.

skhoroshavin commented 7 years ago

if your data is memory mapped pre-indexing may be preferable

Didn't think about this case, thanks for pointing. And this is another reason for buffer being read only.

I'm not sure how you can avoid collisions unless you implement chaining

Yes, I do implement chaining with linked lists, but in relatively cache friendly way.

Data layout: keys[key_count] - array of keys stored in hash _hash[hash_size] - hash array, containing indexes of corresponding keys, initialized with some invalid index (I use MAX_UINT) _next[key_count] - array of "next" indexes to resolve collisions

Insertion: 1) Append key to array of keys 2) Calculate key hash index, read value from hash array and append it to "next" array 3) Store index of just appended key in hash array

Lookup: 1) Calculate key hash index, read value from hash array, lookup key in keys array 2) Until key doesn't match read next index and check next key 3) Stop when key match or next index is invalid

This way insertion is constant time, lookup and deletion is mostly constant time, and there is no degradation over time since no tombstones are needed. But yes, at the cost of more memory consumption.

But, collisions are really cheap if done right - I'm down to a few nano seconds on average for small tables

To be honest for really small tables I find dumb linear search perform better than hashing, which could be explained by the fact that cold hash lookup is at least two cache misses, while linear keys access is only one cache miss, and memory access pattern is very predictable.

As I recall, quadratic probing isn't worthwhile either

Not really surprising given more random memory access pattern :)

Hashes are bit more involved if you require separate memory. You suggestion appears to be doing it lazily which is also an option, although it may cause unwanted performance spikes.

I see your point and agree with it. Flatbuffers were designed to be zero cost to parse, and I think any "required" processing will just defeat their purpose.

mikkelfj commented 7 years ago

I have used chaining for a long time - it is only recently I started looking more at open chaining. But I am pretty sure the extra memory of chaining does not pay off if you invest in a better load factor instead. But that's not a reason to avoid chaining which is simpler in most cases - the flatcc builder uses chaining for vtable conflicts.

As to small tables: we are talking about, up to maybe 3000 elements and a highly efficient hash function such as a Knuth prime multiplier on low valued integer keys (like flatbuffer offsets), then I'm pretty sure linear search won't pay off. But it many cases a simple linear search may be plenty fast, and much simpler to actually get to run fast, than other methods of lookup.

mikkelfj commented 7 years ago

Hi Sergey, any status on this, and the comment bug?

skhoroshavin commented 7 years ago

Sorry for delay, got to speed up development of my project last week so work on flatcc was postponed a bit. I already fast-fixed comment bug, and I was actually planning to write tests and implement scan_by this night. I also found one more bug in JSON-parser, didn't look closely into it (it appears only when parsing JSON which doesn't follow scheme), but hope to fix it either this or next night.

mikkelfj commented 7 years ago

Thanks for the update, no time pressure, I just like to know that status with open issues. Because some of the JSON issues are quite serious, I'd like to make new release before the holidays, but also not release while there are low hanging updates pending.

It is a bit embarassing with all these JSON bugs, I didn't write as many tests as I should have, but it was a project the consumed way more time than originally anticipated so I cut some corners, and haven't used the feature myself yet. I'm glad you find it useful even though you have to suffer the early adopter consequences.

mikkelfj commented 7 years ago

BTW: I don't mind squashing lots of related commits, but I think it is best to keep separate issues fixed in separate commits for traceability, so comment bug need not wait for scan-by for instance.

skhoroshavin commented 7 years ago

I'm glad you find it useful even though you have to suffer the early adopter consequences.

Well, despite bugs in JSON parser I still find it very productive to use flatcc in plain C project, because it provides so many benefits like automatic schema checking and strong typing. Also I'm actually quite happy that I can help in development of flatcc, I think it's a quite good way to say thanks for your work :)

Besides, if everything goes well I'm going to use flatcc for another part of my project. I'll need ability to read fields of arbitrary flatbuffer given it's binary schema, and as far as I understand it's not yet supported by flatcc. In this case I'll have to implement this too, hope this will be good enough to be merged as well.

BTW: I don't mind squashing lots of related commits, but I think it is best to keep separate issues fixed in separate commits for traceability, so comment bug need not wait for scan-by for instance.

Okay, I'll provide one commit per issue.

mikkelfj commented 7 years ago

I also created flatcc for a reason, so I have some of the same runtime concerns that you do, but it is too early to tell exactly how, and if it goes public or not. I'd rather keep things out than forcing a too specific use case into flatcc itself, but it seems we can collaborate on certain issues whether they go to the core flatcc project or not. I am also considering using dynamic link libraries and automated compilation for accessing flatbuffers online for new schema. This speaks against working too hard on binary schema lookups if you can generate code to do that work, but again, it depends.

skhoroshavin commented 7 years ago

Before starting working on scan_by I need to clarify it's interface. What about:

// Two functions along line of find_by
*_scan_by_field( array, value ) 
*_scan_by_field_n( array, value, length )
// And two functions to allow incremental search
*_scan_by_field_next( from, array, value ) 
*_scan_by_field_n_next( from, array, value, length )

Reasoning for providing two types of functions is that when we want to find just one value we don't have to provide additional parameters, and if we need to iterate over values then quite idiomatic loop can be used:

for( int i = monster_scan_by_name( monsters, "TwoFace" );
      i != nsc(not_found);
      i = monster_scan_by_name_next( i, monsters, "TwoFace" ) )
{

}
mikkelfj commented 7 years ago

I need some time to inspect this in detail - but the deciding factor is symmetry with the find_by methods.

The _n suffix is only for string types I presume. Normally there is _str, _str_n suffix in the api, but it might be implicit in find.

As to the offset placed first, I'm not really sure, but it is probably a good idea to have a separate next function, as long as it works the same as without _next when given offset 0.

skhoroshavin commented 7 years ago

I really meant _str and _str_n endings, just forgot to add them. And I feel more like separate next function should work the same as without _next when given offset -1, to allow cleaner iteration code.

skhoroshavin commented 7 years ago

Oops, I've got something mixed up. No _str, since there is really no _str in find_by, I've started from examining monster_test and adding failing scan_by test along lines of find_by test

mikkelfj commented 7 years ago

Often negative offsets mean searching from end. Besides, offsets are generally size_t in array indices, it is best to keep it that way in all interfaces.

But I see, if you find a position, you want to present it to the next interation and exlude that position. I think we need to look a bit a common interfaces such as STL. Either you seach incl. and then add 1 when iterating next, or you define (size_t)-1 to indicate start, possible with flatbuffers_start constant being defined, simliar to invalid search result define. (Or a more suitable similar name).

Do you find the missing _str in the interface is wrong? I probably did it deliberately becuase it just clutters the interface when the type is already known.

skhoroshavin commented 7 years ago

I'm okay without _str, function names are already quite long. Also I've made these tests before thinking about _next variant, I'll update them later. Also to reduce names further maybe it would be a good idea to do something like

for( int i = monster_scan_by_name( monsters, "TwoFace" );
      i != nsc(not_found);
      i = monster_next_by_name( i, monsters, "TwoFace" ) )
{

}

As for negative indices - I think first and foremost they shouldn't break nsc(not_found) constant. I'll look into this.

mikkelfj commented 7 years ago

I think the offset should be after the monster argument. Generally position 0 is the object the method is applied to.

I also believe the offset should be seen as an STL iterator, which means search is inclusive, even if the iteration is not very clean. http://en.cppreference.com/w/cpp/algorithm/find We could consider and exclusive end argument as well.

The following is just brain storming

find(monster, "TwoFace")
find_from(monster, i)
find_in(monster, first, last)
size_t i, k, n;
const char *key = "TwoFace";

for (i = 0, n = 0; ; ++i, ++n) {
    if (nsc(not_found) == (i = monster_find_from(monsters, i, key))) {
        break;
    }
}
test_assert (n == 1); /* ca. */
i = 0, k = monster_len(monsters);
if (nsc(not_found) == (i = monster_find_in(monsters, i, k / 2))) {
    i = monster_find_in(monsters, k/2, k, key)
}
test_assert(i < k);
skhoroshavin commented 7 years ago

Well, I do consider i as an iterator, but for iterative search I find it much more convenient to use filter_iterator than std::find you mentioned. Following this concept:

After a second thought it becomes apparent that scanby***_next functions could be also useful with find_by, so there might be really reason to actually rename them to next_by.

Considering implementation, negative indices can be avoided by implementing find_by_**_in( obj, begin, end, value ) modelled after std::find (there are useful cases for that), then scan_by and next_by become:

int scan_by( obj, value ) { return find_by_in( obj, 0, nsc(not_found), value ); }
int next_by( obj, i, value ) { return find_by_in( obj, i+1, nsc(not_found), value ); }

While next_by is not as efficient as possible in case of sorted lists, I think it's still convenient enough to be actually used. And efficient method for iterating over multiple elements in sorted list is std::equal_range, which also can be placed on todo list.

UPD. Edited some typos

mikkelfj commented 7 years ago

In think we largely agree on the find_by_in underlying concept, except I would say the term find shoud be reserved for at least O(logN) efficiency in flatbuffer world (regardless of STL), so I am thinking scan_range_by or scan_in_by (like SQL world).

I am not quite sure why to place the by word. Find must have some precedence but up front scan_by_in sounds stupid (which may be ok if it is easy to use), so despite my preference for solid prefixes, there is some options. The by is a suffix due to multiple keys, and the first key has methods with same name, without _by suffix to use the default key. This suggest placing by last.

I think it is safe to implement the find_by_in / find_in_by / scan_range_by method and test that function even if we have not decided for a final name.

The other functions are more open for debate, and it would be good to let it settle for a few days, also I've got a lot of other things going on right now.

The linear scanning methods should use a shared prefix instead of find, scan, next, ... This is how all other methods are designed in order to handle name space pollution, for examle _vec_len is length applied to objects supporting the _vec interface. I then like short suffixes like, at, in, by, e.g. _vec_at.

As to the next and not a scan_at_by / scan_from_by function being present at all - I'm not totally sold - it results in many functions for a limited use case - but I see the idea. Some iterator concepts has a first, next but I'm not sure it applies here.

Finally, if we have a range search, it follows that binary search probably should too. Note that the internal binary search / sort operations have and inclusive last iterator as it made implementation cleaner, but external interface should have an exclusive last iterator.

mikkelfj commented 7 years ago

Wrt naming and implementation, please be aware that there are two versions: one that applies to the typed object, and one that applies to the table and member holding that typed object.

Without recalling details, the concept is like monster_evilspawn_find(mon, "TwoFace"), string_vec_find(s, "TwoFace"). The former is a concenience wrapper for the latter, and the macro impl. handles most of this seamlessly.

skhoroshavin commented 7 years ago

In think we largely agree on they find_by_in underlying concept, except I would say the term find shoud be reserved for at least O(logN) efficiency in flatbuffer world (regardless of STL), so I am thinking "scan_range_by" or "scan_in_by" (like SQL world).

Agreed.

Find must have some precedence but up front scan_by_in sounds stupid (which may be ok if it is easy to use), so despite my preference for solid prefixes, there is some options. The by is a suffix due to multiple keys, and the first key has methods with same name, without _by suffix to use the default key. This suggest placing by last.

Considering some examples, monster_vec_scan_by_name_in() doesn't look that stupid. For default keys this becomes monster_vec_scan_in(). But renaming in to range and moving around also does make sense: monster_vec_scan_range_by_name(), monster_vec_scan_range(). Both variants are good enough for me, pick whatever you prefer.

The other functions are more open for debate, and it would be good to let it settle for a few days, also I've got a lot of other things going on right now. The linear scanning methods should use a shared prefix instead of find, scan, next, ... This is how all other methods are designed in order to handle name space pollution, for example _vec_len is length applied to objects supporting the _vec interface. I then like short suffixes like, at, in, by, e.g. _vec_at.

While I agree that find is for less than O(N), and scan is for O(N), I'm still not sure about need to mark next function as scan, since it's actually usable with both find and scan, but I see your point. And as for it's existence I think being able to write clear idiomatic code really helps avoid stupid bugs.

Finally, if we have a range search, it follows that binary search probably should too. Note that the internal binary search / sort operations have and inclusive last iterator as it made implementation cleaner, but external interface should have an exclusive last iterator.

Actually I'm not really sure that scan_range_by should be made public at all. Maybe it should be just internal driver for scan/next functions. This is really debate between generality and ease of use.

mikkelfj commented 7 years ago

scan_range_by can be __scan_range_by for a start. But it depends on what other public methods appear. However, a __ version should expect the end argument to be len for efficiency reasons. A public version should check range and map accordingly.

As to naming of next, this also has to do with mapping methods into the namespace of table fields. If a table suddenly has a members named {a, a_next} even though names shouldn't have underscores, there is no guarantee, I think. Therefore the exposed surface should be small and unlikely to trigger. In fact, there is an earlier breaking interface change to move the api towards this consistency.

So I might agree with you on the usability side, but there are other concerns.

mikkelfj commented 7 years ago

There is also a special case when a returned object is null. In this case search should still be valid but fail. This follows naturally from _vec_len on null pointers returning 0. It makes it simpler to handle absent fields in a vector.

skhoroshavin commented 7 years ago

As to naming of next, this also has to do with mapping methods into the namespace of table fields. If a table suddenly has a members named {a, a_next} even though names shouldn't have underscores, there is no guarantee, I think.

Thanks for pointing, I'll think harder.

mikkelfj commented 7 years ago

I'll think harder.

appreciated :)))

On naming, it is critical to follow the lead of the find implementation, it overrides what I say here if in conflict.

monster_vec_scan_range_by_name

seems good to me.

skhoroshavin commented 7 years ago

monster_vec_scan_range_by_name seems good to me

So be it :)

On naming, it is critical to follow the lead of the find implementation, it overrides what I say here if in conflict.

Totally agreed.

Considering names collision of {a, a_next} with next function - besides std::next in STL there is std::advance function, so naming like advance or just adv instead of next might be more acceptable. Other variants could be _next and nxt, which also might reduce collision chance.

And the most radical possibility is to look for collisions in code generator itself and either not generate next iterating function at all, or add underscores. But this is beginning to look like some monstrosity.

mikkelfj commented 7 years ago

I am mostly in favor of just scan and scan_range_by_name. Either you need a simple search, or scan_range does the job cleanly. Adding 1 to the returned result seems simpler than a more complex api that must be documented and understood. Also, it is easy for users to build on top of it for own purposes.

I think we should have both __scan_by_range (optimised) and scan_by_range (length checked).

Only, it annoys me a bit that scan_range requires and end argument always - nsc(end) could do it.

Alternatively one had a find first, find next model, but that better suits iterators that cannot be indexed. We could have a scan_at/scan_from, and/or scan_next, but it just starts to become confusing and even if it is useful, it rubs a bit against the remaining api, and e.g. doesn't map cleanly binary search. So I won't entirely rule it out in the future, but I'd prefer we start with the above, and consider what to do with find_range_by_name.

On the nxt proposal, it still goes against the guiding principle of one prefix for each operational interface, i.e. it might work but is inconsistent. On the _next proposal, there is already _size so you have mystruct__size() to handle empty structs where sizeof fails on non-gnu C, but that is reserved for near-internal use, and external use that really needs a consistent interface (code generators).

mikkelfj commented 7 years ago

argh + 1 doesn't work when not_found is returned. But even with next that is an issue, so next would have to check for invalid indexes.

mikkelfj commented 7 years ago

BTW: briefly looked through the test cases and it looks good. Especially that Gulliver is found in multiple unsorted places. https://github.com/skhoroshavin/flatcc/commit/36682dfed32b97f05a21ec6150f76f0ec08cb449

mikkelfj commented 7 years ago

The test also needs to cover at least one basic vector. There is no string_vec_find, string_vec_find_n test which there probably should be, but there is a test for uint8_vec_find. All scalars behave largely the same so one is sufficient.

https://github.com/skhoroshavin/flatcc/blob/36682dfed32b97f05a21ec6150f76f0ec08cb449/test/monster_test/monster_test.c#L1220

skhoroshavin commented 7 years ago

Started implementing scan_by by copying and modifying some find_by pieces, got tests passing. I was going to refactor this to have scan_range_by, but I feel like tests are asking me if end parameter is really needed? Modelling std::find is great, but how often there will be cases when end != vec_len, is it worth added complexity?

mikkelfj commented 7 years ago

I have same doubts, but if we don't have it, what would the name be? just scan by always taking an argument? OTOH if there is a huge vector e.g. scanning IP ranges where you up the key to next block on each search, you want to limit the range and see what is there. This is a sorted scan use case BTW.

I think we shouldn't rush this - suggest implementing __scan_range and basing code on that, then there is time for things to settle.

mikkelfj commented 7 years ago

I'll be mostly offline tomorrow, so we can follow up towards the end of the week.

mikkelfj commented 7 years ago

Right now, I think we need scan_at_by_name, scan_by_name, and scan_range_by_name. There is no nexr in this, but it still doesn't feel right.

scan_after_by_name would be an alternative to next.

skhoroshavin commented 7 years ago

Okay. Before you go offline - I've discovered one more thing. In our discussion of possible name collisions with next you mentioned that besides <typename>_vec_find_by_<fieldname> functions there should be <typename>_<tablevecfieldname>_find_by_<fieldname> functions (as well as scan_by variants). But looking to generated monster_reader.h I don't see second variant anywhere. Is it a bug, or it's just not yet implemented?

mikkelfj commented 7 years ago

It could be a bug. Many vector ops are buried in table prolog that adds vector ops even if there is no table vector in the schema, but since find is special and only applies to some vectors, it does not necessarily happen that way. If we add to much logic, perhaps a special prolog is needed, if not already present.

skhoroshavin commented 7 years ago

Okay, I'll look into this. By the way, scan_by and scan_at for string fields seems ready, would you mind merging it upstream?

skhoroshavin commented 7 years ago

After that I'll continue on scan_by for scalars and add scan_range_by, which look slightly more complicated.

mikkelfj commented 7 years ago

It looks good as far as I can decipher it - on terminology, I would use the term _scan, not _scan_by in test cases because scan_by is really a special case of scan for multi-key handling. For the same reason I think a test case is missing for scan, using the default key.

Assuming name is the only, or first, key.

if (1 != ns(Monster_vec_scan(monsters, "Joker"))) {
    printf("scan_by did not find the Joker\n");
    goto done;
}
mikkelfj commented 7 years ago

Also, I believe a test case is still missing for basic vector types such as uint8_vec_scan, and ideally also nsc(string_vec_scan) although find doesn't have those tests.

mikkelfj commented 7 years ago

Ah sorry, you did write scalars are not ready.

I'll merge when you have the default key covered. This closely follows the C++ find concept and should be present.

mikkelfj commented 7 years ago

I didn't look at the generated output, but it seems that a lot of scan functions are generated straight into the reader file. I try to keep the bulk down here and find is only there because it is special and doesn't consume a lot of space. With the new additions, I think there is a need for a new macro to handle all search related functions. for example "__%svector_field(%sstring_t, %llu, t, %u)\n", could be replaced with "__%svector_searchable_field(%sstring_t, %llu, t, %u)\n", when a key is present, or it a could an extra macro call for keys.

https://github.com/skhoroshavin/flatcc/blob/2191b1f2ef1f4575a8eda4031e81352a450d5e3c/src/compiler/codegen_c_reader.c#L1175

and here is the location to handle the default key

https://github.com/skhoroshavin/flatcc/blob/2191b1f2ef1f4575a8eda4031e81352a450d5e3c/src/compiler/codegen_c_reader.c#L1212

The default might also need a macro given the number of functions.

skhoroshavin commented 7 years ago

Core functionality implemented, tests fixed to use default key

mikkelfj commented 7 years ago

Looks good, great work - this is another complex part of flatcc that you have mastered.

Sorry for being pedantic, but the default scan method also need at least a single _by call in tests as well to make sure both are present in case someone messes with things later. (And yes, I might not have done that in my own tests, ...)

Also, as mentioned earlier, all the search logic eventually needs to collected in its own macro, otherwise it is going to dominate the generated output disproportionately. But it doesn't have to be in this PR.

And the doc also needs to be updated eventually - I think it is in build.md, though this isn't really right, it is the only place atm.