Merging Dumps - Githubissues

rye commented 9 years ago

We have this problem (as brought to my attention by @vmai and @Sammidysam) where it is currently impossible to merge dumps together, given that teams might be redundant across the databases (i.e. if a team participated in multiple competitions). What this effectively means is this: multiple different Teams might be created with the same number, which is incorrect.

My proposed solution is to merge conflicts like this as follows: when multiple Teams are trying to be created (if there are duplicates), use the one that's already in the database. Take all Participations that reference the wrong one, and make them reference the Team that's already in the database. This is ugly, because you have to find all participations with a dangling reference to something that doesn't exist, but it just makes sense.

Also, a point was made about conflicting matches, but I don't think we need to do anything with those, since they need to be unique across competitions.

rye commented 9 years ago

A quick update: I still think this is an issue, but we can put it off and prioritize it as an improvement.

Definitely, merging models together is important. We have scrapers, and they should include some kind of functionality to handle merging conflicting instances of the same object.

It makes more sense to me to iterate through every instance in a dump-to-merge, and, if a model exists with similar or congruent fields in the stored database, determine how to handle it. Objects in the merging dump with identical values (which would be Teams, Robots, and Competitions, most likely), would be re-directed to attach to their identical twins in the stored database that have the same fields.

Part of the problem here is that it just makes sense to use a single database continuously. We should really include the solution for this Issue as a side tool, since it's not really a part of FReCon proper.

rye commented 9 years ago

So I've been thinking about this a bit. There are many interesting ways we could go about taking care of this.

Here's what I'm thinking is the best option:

We will always encounter a problem if multiple teams, competitions for the same thing exist.
Consider the existing database to be the correct one (called the "main database" for didactic purposes), containing the correct records that need to be merged into.
Consider the database to be merged to be questionable (the "merging database").
- If two conflicting records (both containing the same values for all fields except the ID) exist in both databases, point everything that points to the conflicting record in the merging database to the record in the main database by triaging each ID reference to the merging record to the ID of the main record. Then, since the relationships are taken care of, delete the conflicting record.
- If a record exists in the main database but not the merging database, keep it, and if a record exists in the merging database but not the main database, add it to the main database.

A number of problems with this:

How do we actually do this? Load the merging database as a big, spooky Hash (since it's given as JSON) and then just go through every object in that database?
With large databases, performance? Flattening a merging database out into a big Array of records and just going through and updating stuff as necessary seems fast enough to me.
- C extensions for this? Surely not—nothing else is.
Reliability? Are there any edge cases I'm not thinking of?
Should there be any special logic for determining if things are conflicting, or should it be an exact comparison? (e.g. case-insensitive matching for lazy non-capitalizers, partial-match recognition, etc.)
- Can we just compare Hashes? Is that good in Ruby? If so, comparison of two records is as simple as deleting the ID's off of both and comparing the Hashes if we don't have any special logic to do.
- If not, it gets more complicated.

I'd love to see progress on this since it seems like a big thing to do, but because it is such a big thing to do I'm hesitant to dive into it.

rye commented 8 years ago

After some hammock time, I think I've settled on some ideas:

Load up the merging database as a Hash. (by parsing JSON)
For each item, try to find an existing record in the main database that matches it.
- If such a record exists, use it. Iterate through every child record in the merging database, update the ID's to point at the record in the main database.
- If no such record exists, create it in the database, and update every child record in the merging database to point to the new record.

Performance really doesn't matter. We can expect the process of merging databases to be a bit lengthy. We should _definitely_ emit debug information.

Sammidysam commented 8 years ago

Commenting first on your comment from October, I don't think we can simply compare hashes for all of these. Do note that we allow custom attributes on every single model, no? What if somebody put a custom attribute on something but not in the same thing in a different database? All of a sudden we would see them as unique. Instead, I think we need special rules for every single model that determines whether it is a clone of something else or not. This could definitely get a bit weird.

I agree that we should not worry about performance. When merging massive databases, I actually expect it to take a decent amount of time. I also agree with most of items you had mentioned besides if we could compare hashes or not - determining which database is the given "right" one, loading the database into a massive hash, etc. My final question I had was could loading a massive database into one Ruby hash variable by any chance cause memory usage problems? I assume not, since computers are powerful nowadays, but it is worth asking.

I'm proud of myself for finally responding to this. ;)

rye commented 8 years ago

@Sammidysam, I like your suggestion of adding special rules (through a method) for every model. I also realize now that we can simply create a model via (Model).new and use that as an actual representation of the model, not just a silly Hash.

The Hash model I suggested could be extended to improve reliability (e.g. by ignoring custom attributes where they are insignificant), but that would take quite a bit more effort and would be less OOP-y.

It might make sense for us to create some more unique required fields in some of our models (e.g. year in Competition, year in Robot) so as to help eliminate potential duplicates which would eventually result in a disparate database containing both sets of records that we would have.

Sammidysam commented 8 years ago

Making a model object sounds like a good means of doing this in a less disgusting way. I assume giving it a hash as an argument will preserve the custom attributes, right? Also, how would we be able to know which custom attributes are insignificant or not (you had mentioned that in your second paragraph)?

For some reason, I really thought that the year attribute already existed in Competition. Did we just put the year of the competition as part of the name in the past? I guess we cannot really trust the name to actually tell if a competition is unique or not, as some group may abbreviate the name while others may keep it long and full. We would probably need to enforce some of this information to allow for merging databases successfully, as you had mentioned. Maybe we should try to form a list of the information we need to know in order to determine uniqueness.

rye commented 8 years ago

@Sammidysam I'm referring to creating a model object using our existing Models; .create is supposed to create and save the object, whereas .new simply creates the model, and needs to have #save called on it. We would set the attributes either (a) by iterating through all of the attributes and setting them, or (b) by passing a Hash into the .new method. I'm not sure if (b) works.

I think having year and such 'absolute' values be emphasized is important for the reason you specify. It is very easy to have different names and locations for competitions.

In short: The comparison logic should be housed within the Model itself. If we can instantiate a model, and use our own ObjectId, we can then create two simultaneous databases—the one that is actually the database, and then a system of Arrays in memory; a fake database, of sorts.

Personally, I would also like to have available some kind of manual process for doing this. Perhaps, a MergeProposal object would get generated for two given objects, and this could handle the logic behind it all. In most situations, where the two databases are identical in nature and have no glaring differences, it would follow to use some automatic process (which would be the default) where all of these MergeProposal's would just automatically get accepted.

To expand, I think a Merge object could be created as a master to contain all of the Proposal's. It would be initialized with the new database, and could internally generate the various Proposal's for each object. (It would try to instantiate each of the objects in the "merging" database, then would ask the "main" database if each of those objects were duplicates or new. If they are new, they get added, else they get merged somehow.)

This may be overcomplicating things, but it might make things very intelligent, which would have great rewards. :)

Sammidysam commented 8 years ago

I agree that having some classes specifically for the process of merging databases would probably be the best place to handle all the logic. In the past, we had scrapers, no? I think we can keep the scrapers and advertise that the difference between a scraper and a merge is that the scraper will simply try to copy over all the data as it does now, whereas a merge will intelligently copy over data (clearly a scraper will be faster, and you will probably want it when you are effectively copying a database). What are your thoughts on maybe keeping our scrapers or not?

I seem to remember giving a Hash to .new working in the past, but it will be worth checking on. It isn't really that drastic, as looping through the hash to set values is not that painful by any means.

rye commented 8 years ago

I think that Scraper's and Merge's go hand-in-hand. A Scraper handles the transactions required for gathering data (be that simply over HTTP or otherwise), but the actual merging of that into the database is done via a Merge. Keeping the Scraper construct is good.

As with many things, these questions could be asked by working on the spec and filling some stuff out. :wink:

I will check the functionality of passing a Hash to .new out.

Sammidysam commented 8 years ago

So I think the one thing we both alluded to needing to do pretty soon was defining the rules and variables needed to determine if models are unique or not. How would you like to put work toward that end? Just write how we want to go about it in this issue?

rye commented 8 years ago

Yes. We can go ahead and identify these things in this Issue.

Sammidysam commented 8 years ago

Let's begin with a rather hard model, then, competition. Obviously having a year integer is a need, but how do we figure out the other information in determining if it is unique? We do have two strings, name and location, but both are pretty sketchy to utilize to determine uniqueness. One idea I had was that we could use the teams method that the competition contains. If a competition is in the same year and has exactly the same teams list as another competition, they are the same. Is that safe enough? Obviously the exact same teams could go to a competition within the same year, but I highly doubt this ever happens. The one fault with this idea, I'd say, is that it relies too heavily on the scouts all accurately reporting every team in a competition (obviously team number is a great way to determine if the teams are the same).

Other ideas?

rye commented 8 years ago

What we're really talking about here is validation of uniqueness. Teams, for instance, are really easy to validate because we already have a field that is reserved for that—number. So, we just need to come up with such things. If an object can be added to the database without violating any uniqueness validations, then we should add it. Then, we're just writing uniqueness validators.

Here are my ideas for which fields represent uniqueness:

Competition should use name and a nascent year field as its unique fields.
Robot should uses a nascent year field as its unique field.
Team already uses number as its unique field.

Match, Record, and Participation are all subordinate models, I think. As long as the outermost models, (Team, Robot, and Competition) are all unique and nice, the internal models should be unique as well.

rye commented 8 years ago

I think the uniqueness validation is probably what needs to get done, and really only needs to get done on fields that are required. Thus, if we require certain fields, we can also require their uniqueness if they represent an identifying characteristic of the team.

rye commented 8 years ago

This turns out to be a highly-requested feature from the people I've talked to at competitions about this project.

Sammidysam commented 8 years ago

First off, my apologies if anything I argue here is off from what had been mentioned. It has been a while.

I would not say merging databases is simply writing uniqueness validators. What is the rule for when two objects are deemed to be the same? Do we prompt the user to choose which one to keep? Since every object in our database can be filled with any amount of extra data the end user wants, we cannot simply take it as a given that one database has precedence over the other. I would suggest merging here is like merging in Git with a merge helper: the user is given the option to keep one or the other, or try to merge the two together.

To comment more directly on your uniqueness proposals:

I do not like using name with competition because what if people abbreviate names? It seems like there really isn't a good way.
Yeah, Robot using a combination of year and the number of the team it is attached to sounds great. It will be bad for if a team has multiple robots in the same year, though. Is that legal?

To comment on the subordinate models, as you had said: we still have to deal with special cases relating to these objects. For example, let's say we have data for the same competition taken by two different sets of scouts and we want to merge it. The competition would conflict, but we still want to copy over records that exist in one database but not the other. We may need to change the reference for the competition. A similar thing relates to matches and participations, I'd say. One database could have an incomplete dataset, and through merging we strive to achieve a more complete set of data.

vmai commented 8 years ago

A team should never have multiple robots in the same year, because that's completely illegal.

This is slightly off topic of the thread but since Sam brought it up I'll mention it. There have been cases of robots changing configuration or design between competitions, and these changes could be slight or very drastic. This is obviously an unlikely scenario but we'd have to figure out how to deal with these cases as well.

Sammidysam commented 8 years ago

Okay, the plan for robot uniqueness is all :+1: then. I just don't know anything about FRC.

And yeah, that is somewhat outside of the scope of this issue, but definitely an interesting concern. I assume people would want to make a new Robot object, but here the program probably would not permit it.

rye commented 8 years ago

There have been cases of robots changing configuration or design between competitions, [...] we'd have to figure out how to deal with these cases as well.

That was precisely the argument for using Participations as the central point where all data about the robot at the particular competition they are attending is stored. Robots do not change much at competitions, and when they do, simply re-pit-scouting them would be acceptable.

One of my points for adding a Year or Season model was that Robot could then become like Participation, but with such a nascent model instead of a Competition. We could use Year/Season to validate the Robot and the Competition that it wants to participate in are of the same game. This would mean that Teams are persistent, Robots connect with Seasons, Participations connect with Competitions, (which would also be dependent on Seasons) and then we could even identify the Record format by attaching it to that Season model as well. This is not the issue to discuss this suggestion, however, but it's food for thought and I'd welcome ideas in other Issues.

rye commented 8 years ago

In regards to merging dumps, I refer back to my previous idea: treat the extant database as the master, and the merging database as the subordinate. We could also make it such that the merging mechanism asks questions about which competitions to merge with which, and I do agree that this process should be very interactive so as to prevent errors.

vmai commented 8 years ago

That's right, I forgot that we made that change this year as far as the implementation goes. Sorry for derailing, carry on

Sammidysam commented 8 years ago

Yeah, let's make it interactive when there are conflicts. What definition of nascent here are you using by the way? I've seen it a lot and the dictionary isn't helping me.

On the Robot note, I forgot that Participations share details about the Robot at the given Competition. Since that takes place already, I do not think we have anything to fix.

rye commented 8 years ago

What definition of nascent here are you using by the way?

I was referring to the proposed models. Apologies for confusing wording.

rye commented 8 years ago

I think adding a Year/Season model was discussed in #80 or something like that. I'd be interested to see if that's a valid idea now.

Sammidysam commented 8 years ago

I was asking what the word meant. You're saying it means proposed?

rye commented 8 years ago

@Sammidysam "Nascent" means "just coming into existence and just beginning to display signs of future potential." I like fancy words. 😀 Since I was referring to these actions in a conditional or optional state, I figured I could use nascent to describe the Year/Season model since that is what would be just coming into existence with all of these changes.

rye commented 8 years ago

One other thing that I would mention as a possibility: we could also have FReCon import (by scraping) the database into another Mongo database on the system before merging. This would allow us to deal with associations, but it could make things quite messy.

Sammidysam commented 8 years ago

How does that help us deal with associations versus importing JSON? That we can use the Mongoid relations methods?

rye commented 8 years ago

@Sammidysam Yes, I believe so. We could also rename all the id fields to old_id and then do relations simply by querying by old_id instead of id. That's funky too, though.

Sammidysam commented 8 years ago

I don't think we would need to do that if the other data is stored in a fully separate database. I think when we are replacing references, all we really need to do is have a sort of dictionary effectively (sorry that I'm using the Python term) where we map IDs to new IDs under the given models. For example, let's say Competition A conflicts with Competition B. We are replacing references of Competition B with A. All of the Matches then look into the dictionary, see that B goes to A, and replace correspondingly.

rye commented 8 years ago

That would make sense, and that was the initial suggestion as I recall. It would be much lighter and probably easier to implement.

frc-frecon / frecon

Merging Dumps #57