fnc12 / sqlite_orm

❤️ SQLite ORM light header only library for modern C++
GNU Affero General Public License v3.0
2.25k stars 312 forks source link

Reverse engineering complex databases - automatically generate storage definition and data structures #450

Open prospectumdev opened 4 years ago

prospectumdev commented 4 years ago

I have databases with many complex tables (lots of columns, foreign keys, range checks etc). Manually creating the storage definition and all the structs would take maybe a week. It is also likely that errors are made in the process.

Is there some kind of parser that generates the structs and the storage definition from an existing SQLite db?

If not, I am tempted to write a SQLite3 -> sqlite_orm definition parser that reads _sqlitemaster and outputs a .h file with all you need in order to use the db with sqlite_orm.

fnc12 commented 4 years ago

Hi. Thanks for using the lib. How many tables do you have? I am just curious. I thought about codegen from SQL query. I want to make a website where you can type an SQL query and press generate and you would get C++ code written with sqlite_orm. E.g. you type:

CREATE TABLE contacts (
    contact_id INTEGER PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    email TEXT NOT NULL UNIQUE,
    phone TEXT NOT NULL UNIQUE
);

will generate

struct Contact {
    int id = 0;
    std::string firstName;
    std::string lastName;
    std::string email;
    std::string phone;
};
auto storage = make_storage("db.sqlite",
                            make_table("contacts",
                                       make_column("contact_id", &Contact::id, primary_key()),
                                       make_column("first_name", &Contact::firstName),
                                       make_column("last_name", &Contact::lastName),
                                       make_column("email", &Contact::email, unique()),
                                       make_column("phone", &Contact::phone, unique())));

or if you type

SELECT first_name, salary 
FROM emp_master;

you will get

auto rows = storage.select(columns(&EmpMaster::firstName, &EmpMaster::salary));

What do you think?

prospectumdev commented 4 years ago

There should be somewhere around 70 tables (2 different schema definitions). Many of them have several dozens of columns.

The results you describe are exactly what is needed (although it seems that NOT NULL is ignored in the example). That would make sqlite_orm even more efficient to use and attractive!

I suppose that when it comes to details (e.g. the determining the right order of the make_table entities in the presence of foreign keys), one needs to be careful.

May I ask how you are planning to implement this transformation?

fnc12 commented 4 years ago

1) parse a query with sqlite vanilla parser 2) analyze model obtained by parsing 3) generate code or show error with text "this feature is not supported right now"

prospectumdev commented 4 years ago

Sounds good! What priority does this code generation thing have for you?

fnc12 commented 4 years ago
  1. Bugs
  2. Minor features
  3. Codegen
  4. Major features
prospectumdev commented 4 years ago

I will need to have the tables converted in about 2 weeks. Do you think the code generator might be ready then? If not, no problem, then I'll write a simple object oriented parser just for 'create table'.

fnc12 commented 4 years ago

Probably it will be ready for create table

prospectumdev commented 4 years ago

Awesome, thank you so much!

prospectumdev commented 4 years ago

Quick question: How are types translated into sqlite_orm, e.g. UNSIGNED INT(4), CHARACTER(1), VARCHAR(64) etc.? In the case of UNSIGNED INT(4) => int, information is lost so that the databases created by sqlite_orm are not identical to the original ones.

prospectumdev commented 4 years ago

And another Question: here...

struct User{
    ...
    std::unique_ptr<std::string> imageUrl;
    ...
};

...unique_ptr means that imageUrl = nullptr is allowed. Correct?

fnc12 commented 4 years ago

1) https://github.com/fnc12/sqlite_orm/blob/master/include/sqlite_orm/sqlite_orm.h#L164 2) yes, nullable types skip NOT NULL constraint during syncing a schema. Nullable types are std::unique_ptr, std::shared_ptr and std::optional. You can specify any other type as nullable in your application

prospectumdev commented 4 years ago

Thanks! In the meantime I have written a quick and dirty code generator based on the SQLite C/C++ API. It can already create the structs. Next, I'll do the make_storage part. I can make the code available on github if you like.

fnc12 commented 4 years ago

This is cool! But it is not required cause I have already started making injections in sqlite source to make available statement analyzing. I thought about creating custom parser but native parser is always better cause it will parse 100% of queries but non-native will parse <100% queries and must be regularly supported. Anyway parser creation is a good exercise to get amazing experience!

prospectumdev commented 4 years ago

Generated Code:

The compiler says I must use the command line option /bigobj in order to get it to compile. I am using /bigobj but it is still not compiling (Error C1128) :-(

prospectumdev commented 4 years ago

Ok, I put /bigobj into the linker command line options instead of the compiler. Obviously I need a break but it seems that I am almost there. Takes long to compile though.

prospectumdev commented 4 years ago

Ok, I can create an sqlite3 db using sqlite_orm and the generated definition (103 KB of code, I am happy that I did not have to write this manually). Now I need to have a look at the foreign keys. But maybe I should call it day for now :-)

prospectumdev commented 4 years ago

I am stuck with the data type mapping. Seems like it is not possible to create a db definition for sqlite_orm without losing the detailed data type specification of the SQLite db to be modelled. That might be a big dealbreaker because then the Host software is most likely going to refuse to read dbs created with sqlite_orm...

I need to exactly recreate the existing SQLite db. Is this possible?

prospectumdev commented 4 years ago

How does sqlite_orm deal with circular foreign key references? I just discovered that the schema I am currently working with has 7 circular references...

fnc12 commented 4 years ago

@prospectumdev what will be lost? sqlite_orm cannot handle circular references. This is a reason to improve sync_schema call

prospectumdev commented 4 years ago

what will be lost?

The data type definitions as specified in the original databases. For instance, 'UNSIGNED INT(4)' is mapped to 'INT', so that databases created with sqlite_orm will always differ from original ones. However, contrary to my expectations it seems that the host software I am using does not care. So it is not a big problem.

sqlite_orm cannot handle circular references

In my tables there are self references and circular references. is it possible / are you planning to support these references? As for timing, is this considered a minor or a major feature?

Plus it seems that it is not possibly to specify validity checks via the check constraint, like CHECK(CustomDisplay1_PosPixels_Y=0 OR CustomDisplay1_PosPixels_Y BETWEEN 0 AND 8192). The databases I am using heavily rely on checks. Again, the host software does not seem to care but I know from experience that there is a good reason for all those checks to be there.

From my perspective, my assessment of these three issues is the following:

  1. Circular references: Important feature.
  2. Checks: Important feature.
  3. Non-SQLite data type names: Seems to be more of a nice-to-have feature.

This is a reason to improve sync_schema call

So you are planning to add support for circular references and self references? That is great news!

What about check constraints? If sqlite_orm is supposed to do the checking itself, this might be a lot of work. However, why not simply add the check constraints to the database and have SQLite itself do the checking? In that case it should be much easier to add this feature.

As for the data type names: It seems that SQLite allows for data type name aliases to be stored. While I am not sure if it is a good practice to use those at all, SQLite does it and so, for completeness, it might be nice to represent that, too. Or sqlite_orm could explicitly refuse to use data types other than those specified by the 5 storage classes and rely on checks in order to ensure validity. While this is a plausible approach, my experience from product development tells me that for psychological reasons this is likely to have a (small) detrimental effect on the user acceptance of sqlite_orm.

fnc12 commented 4 years ago

1) If you have a database with a column with type UNSIGNED INT(4) and a member mapped to this column with type int the column will never be altered during sync_schema cause UNSIGNED INT(4) maps to int 2) circular references are delayed until someone asks for it. Looks like it is time to implement this feature 3) CHECK constraint is also delayed until someone asks I don't see the problem in different types' names cause names are only aliases. I think I shall implement CHECK constraint first, circular references next and SQL parse after that. I am afraid I can not make it in two weeks so you should use you own parser probably.

prospectumdev commented 4 years ago

Thanks a lot, I'll find a way to get something going that I can work with at the moment. If the host software accepts databases without foreign keys and checks, that's ok for now.

I think you are doing an excellent job with sqlite_orm, and I think it can become very popular once the features are there and it is convenient to use (example: code generation). May I ask you about your plans for sqlite_orm? Is it a private or professional project for you? Are you planning to develop/support it long term?

prospectumdev commented 4 years ago

Feedback: It kind of works: I can now read an existing database, generate the sqlite_orm header file (no foreign keys, no check() constraints) and make sqlite_orm produce a working database. 'working means' that I can insert all the data of the preexisting database into the new one and the new one is accepted by the host software! Unfortunately, when I try to open an existing database with sqlite_orm, it is completely overwritten. The preserve flag (sync_schema(true)) does not help. So it seems that I will have to transfer the data manually at the moment.

fnc12 commented 4 years ago

Data is lost cause something in schema differs. Please show me the results of sync_schema call

prospectumdev commented 4 years ago

My quick and dirty code: https://github.com/prospectumdev/200111_sqlite_orm_codegen

I'll post the results of sync_schema shortly.

prospectumdev commented 4 years ago

sync_schema() returns "old table dropped and recreated" for all tables sync_schema(true) also returns "old table dropped and recreated" for all tables

I found the reason: It is happening because of the data type names. For testing, I modified a small table, setting data type names to SQLite standards, while leaving all other tables alone. Now, for the modified table, the output is "table and storage is already in sync."

Maybe custom data type name support isn't a 'nice to have' feature at all...

prospectumdev commented 4 years ago

Optional data type aliases could solve the problem. Then

make_column("CustomDisplay1_PosPixels_X", &Coupler::CustomDisplay1_PosPixels_X, default_value(0))

turns into

make_column("CustomDisplay1_PosPixels_X", &Coupler::CustomDisplay1_PosPixels_X, "UNSIGNED INT(6)", default_value(0))

If available, the aliases can be used to set table_info[columnid].type when tables are created.

fnc12 commented 4 years ago

what type has Coupler::CustomDisplay1_PosPixels_X?

prospectumdev commented 4 years ago

original db : UNSIGNED INT(6) new db created with sqlite_orm: INTEGER struct created by the code generator: unique_ptr<unsigned int>

fnc12 commented 4 years ago

what C++ type has Coupler::CustomDisplay1_PosPixels_X or what type you expect it to have? int, std::string or what?

prospectumdev commented 4 years ago

unsignedint, as specified in the struct. Or is inta better choice?

fnc12 commented 4 years ago

looks like sqlite_orm doesn't understand that UNSIGNED INT(6) is integer. I shall fix it in a different PR soon

prospectumdev commented 4 years ago

Good news, thanks a lot!

In my case, it is just unsigned int: unique_ptr<unsigned int> CustomDisplay1_PosPixels_X;

fnc12 commented 4 years ago

https://github.com/fnc12/sqlite_orm/pull/453 fix for UNSIGNED INT(6) is on its way

fnc12 commented 4 years ago

bug with UNSIGNED INT(6) is fixed. Now table will not be recreated if you map int or any other integer C++ type to UNSIGNED INT(6)

fnc12 commented 4 years ago

Now I am working on check feature itself

prospectumdev commented 4 years ago

Awesome, thank you so much!

prospectumdev commented 4 years ago

Just tested UNSIGNED INT(...): It is working. sqlite_orm is not dropping those tables any more. However, I discovered that there is the same problem with VARCHAR(...), in my case VARCHAR(32) and VARCHAR(64), and also CHARACTER(...)

Strange, there is std::regex("VARCHAR\\([[:digit:]]+\\)"), in sqlite_orm.h at line 181, mapping VARCHAR to TEXT, so I would expect it to work... Am I missing something?

@character: I am mapping { "character", "string" } in my generator code. When I change that to { "character", "int" }, CHARACTER(...) does not cause tables to be dropped any more. Unfortunately I think I need to map character to a string for the db to work. Plus, according to https://www.sqlite.org/datatype3.html (3.1.1), SQLite interprets CHARACTER as TEXT rather than INTEGER

fnc12 commented 4 years ago

I cannot understand what is going wrong. Please tell me what is your column schema in your database and what C++ class field you want to have to store that column. Example: name VARCHAR(10) NOT NULL in std::string name;

prospectumdev commented 4 years ago

Just a few examples:

"Name"  VARCHAR(64) NOT NULL CHECK(LENGTH(Name)<=64) UNIQUE,
"NumberOfPipes" UNSIGNED INT(3) NOT NULL DEFAULT 61 CHECK(NumberOfPipes BETWEEN 1 AND 128),
"MIDINoteNumberOfFirstPipe" UNSIGNED INT(3) NOT NULL DEFAULT 36 CHECK(MIDINoteNumberOfFirstPipe BETWEEN 0 AND 127),
"IsPercussive"  CHARACTER(1) NOT NULL DEFAULT 'N' CHECK(IsPercussive IN ('Y','N')),

My mapping (for code generation) currently looks like this:

map<string, string> typeLUT{
    { "blob",        "vector<char>" },
    { "character",   "string" },
    { "double",      "double" },
    { "int",         "int" },
    { "integer",     "int" },
    { "real",        "double" },
    { "text",        "string" },
    { "varchar",     "string" },
    { "unsignedint", "unsigned int" },
};
fnc12 commented 4 years ago

CHARACTER(1) maps to TEXT so if your member field has std::string type schema must not be recreated during sync_schema call. Do you still have problems with sync_schema?

prospectumdev commented 4 years ago

I just tested it again, using the latest sqlite_orm code (did a git pull):

Trying to represent CHARACTER(1) by char: Table is still dropped.

fnc12 commented 4 years ago

CHARACTER(1) is TEXT in SQLite. All you can do is make a custom class that holds a char and make bindings to store it as integer (cause single char is integer)

fnc12 commented 4 years ago

or you can store std::string in your C++ code and access it by getter

char getValue() const {
    if(!this->myString.empty()){
        return this->myString.front();
    }else{
        return 0;
    }
}
prospectumdev commented 4 years ago

I am fine with CHARACTER => string or whatever. My problem is that I cannot sync an existing db with sqlite_orm without tables being dropped.

Maybe I should contact the creator of those tables and ask him to stick to standard SQLite storage classes. Data correctness is ensured by all those checks anyway, so there should not be need for exotic (= non SQLite standard) data types.

prospectumdev commented 4 years ago

Ok, I wrote an email to the creator of the schemas I am dealing with, asking him to stick to standard SQLite storage classes. Nevertheless, I think sqlite_orm should be able to handle custom data types. Even if my problems are solved, I assume that at some point the next sqlite_orm user will have the same problem.

fnc12 commented 4 years ago

No need to write to a database creator. CHARACTER(1) is valid storage type in SQLite so all he/she will tell you is IDK CHARACTER(1) is valid type. We need t understand why sync_schema drops your table. sync_schema drops tables when C++ class's field's type is not equal to SQLite types in existing table. Please write here .schema for your table and your C++ class definition like

class MyClass {
    int fieldA;
    std::string fieldB;
    //etc
};
fnc12 commented 4 years ago

Downloaded. Which table is dropped and recreated? @prospectumdev

prospectumdev commented 4 years ago

Coupler -> "old table dropped and recreated" CustomDisplayControlStyle -> "old table dropped and recreated" CustomDisplayKeyboardStyle -> "old table dropped and recreated" CustomDisplayLabel -> "old table dropped and recreated" CustomDisplayPage -> "old table dropped and recreated" Division -> "old table dropped and recreated" Enclosure -> "old table dropped and recreated" ExternalRank -> "table and storage is already in sync." Rank -> "old table dropped and recreated" RequiredInstallationPackage -> "old table dropped and recreated" ShortcutPiston -> "old table dropped and recreated" Stop -> "old table dropped and recreated" StopRank -> "old table dropped and recreated" Tremulant -> "old table dropped and recreated" _General -> "old table dropped and recreated" ValidFixedCodes_AudioSampleRate -> "table and storage is already in sync." ValidFixedCodes_CODM_CouplerCode -> "table and storage is already in sync." ValidFixedCodes_CODM_CustomDisplayPageCode -> "table and storage is already in sync." ValidFixedCodes_CODM_DivisionCode -> "table and storage is already in sync." ValidFixedCodes_CODM_EnclosureCode -> "table and storage is already in sync." ValidFixedCodes_CODM_KeyActionEffectTypeCode -> "table and storage is already in sync." ValidFixedCodes_CODM_KeyActionTypeCode -> "table and storage is already in sync." ValidFixedCodes_CODM_ShortcutPistonActionTypeCode -> "table and storage is already in sync." ValidFixedCodes_CODM_ShortcutPistonReferencedObjectTypeCode -> "table and storage is already in sync." ValidFixedCodes_CODM_StopCode -> "table and storage is already in sync." ValidFixedCodes_CODM_TremulantCode -> "table and storage is already in sync." ValidFixedCodes_CODM_VisualAppearanceCode -> "table and storage is already in sync." ValidFixedCodes_CODM_WindchestLayout -> "table and storage is already in sync." __ValidFixedCodes_ImpulseResponseReverbOriginType -> "table and storage is already in sync." ValidFixedCodes_ImpulseResponseReverbRoomType -> "table and storage is already in sync." __ValidFixedCodes_RankType -> "table and storage is already in sync." __ValidFixedCodes_ReverbTailTruncationMode -> "table and storage is already in sync." done.

fnc12 commented 4 years ago

looks like SQLite type UNSIGNED INT(4) cannot be parsed due to double white space. I need to fix parser to support this case. BTW this is a good example about custom parsers and problems with them =)