Open rye opened 9 years ago
A quick update: I still think this is an issue, but we can put it off and prioritize it as an improvement.
Definitely, merging models together is important. We have scrapers, and they should include some kind of functionality to handle merging conflicting instances of the same object.
It makes more sense to me to iterate through every instance in a dump-to-merge, and, if a model exists with similar or congruent fields in the stored database, determine how to handle it. Objects in the merging dump with identical values (which would be Teams, Robots, and Competitions, most likely), would be re-directed to attach to their identical twins in the stored database that have the same fields.
Part of the problem here is that it just makes sense to use a single database continuously. We should really include the solution for this Issue as a side tool, since it's not really a part of FReCon proper.
So I've been thinking about this a bit. There are many interesting ways we could go about taking care of this.
Here's what I'm thinking is the best option:
A number of problems with this:
Hash
es? Is that good in Ruby? If so, comparison of two records is as simple as deleting the ID's off of both and comparing the Hash
es if we don't have any special logic to do.I'd love to see progress on this since it seems like a big thing to do, but because it is such a big thing to do I'm hesitant to dive into it.
After some hammock time, I think I've settled on some ideas:
Performance really doesn't matter. We can expect the process of merging databases to be a bit lengthy. We should _definitely_ emit debug information.
Commenting first on your comment from October, I don't think we can simply compare hashes for all of these. Do note that we allow custom attributes on every single model, no? What if somebody put a custom attribute on something but not in the same thing in a different database? All of a sudden we would see them as unique. Instead, I think we need special rules for every single model that determines whether it is a clone of something else or not. This could definitely get a bit weird.
I agree that we should not worry about performance. When merging massive databases, I actually expect it to take a decent amount of time. I also agree with most of items you had mentioned besides if we could compare hashes or not - determining which database is the given "right" one, loading the database into a massive hash, etc. My final question I had was could loading a massive database into one Ruby hash variable by any chance cause memory usage problems? I assume not, since computers are powerful nowadays, but it is worth asking.
I'm proud of myself for finally responding to this. ;)
@Sammidysam, I like your suggestion of adding special rules (through a method) for every model. I also realize now that we can simply create a model via (Model).new
and use that as an actual representation of the model, not just a silly Hash.
The Hash model I suggested could be extended to improve reliability (e.g. by ignoring custom attributes where they are insignificant), but that would take quite a bit more effort and would be less OOP-y.
It might make sense for us to create some more unique required fields in some of our models (e.g. year
in Competition
, year
in Robot
) so as to help eliminate potential duplicates which would eventually result in a disparate database containing both sets of records that we would have.
Making a model object sounds like a good means of doing this in a less disgusting way. I assume giving it a hash as an argument will preserve the custom attributes, right? Also, how would we be able to know which custom attributes are insignificant or not (you had mentioned that in your second paragraph)?
For some reason, I really thought that the year
attribute already existed in Competition
. Did we just put the year of the competition as part of the name in the past? I guess we cannot really trust the name to actually tell if a competition is unique or not, as some group may abbreviate the name while others may keep it long and full. We would probably need to enforce some of this information to allow for merging databases successfully, as you had mentioned. Maybe we should try to form a list of the information we need to know in order to determine uniqueness.
@Sammidysam I'm referring to creating a model object using our existing Models; .create
is supposed to create and save the object, whereas .new
simply creates the model, and needs to have #save
called on it. We would set the attributes either (a) by iterating through all of the attributes and setting them, or (b) by passing a Hash into the .new
method. I'm not sure if (b) works.
I think having year
and such 'absolute' values be emphasized is important for the reason you specify. It is very easy to have different names and locations for competitions.
In short: The comparison logic should be housed within the Model itself. If we can instantiate a model, and use our own ObjectId
, we can then create two simultaneous databases—the one that is actually the database, and then a system of Arrays
in memory; a fake database, of sorts.
Personally, I would also like to have available some kind of manual process for doing this. Perhaps, a MergeProposal
object would get generated for two given objects, and this could handle the logic behind it all. In most situations, where the two databases are identical in nature and have no glaring differences, it would follow to use some automatic process (which would be the default) where all of these MergeProposal
's would just automatically get accepted.
To expand, I think a Merge
object could be created as a master to contain all of the Proposal
's. It would be initialized with the new database, and could internally generate the various Proposal
's for each object. (It would try to instantiate each of the objects in the "merging" database, then would ask the "main" database if each of those objects were duplicates or new. If they are new, they get added, else they get merged somehow.)
This may be overcomplicating things, but it might make things very intelligent, which would have great rewards. :)
I agree that having some classes specifically for the process of merging databases would probably be the best place to handle all the logic. In the past, we had scrapers, no? I think we can keep the scrapers and advertise that the difference between a scraper and a merge is that the scraper will simply try to copy over all the data as it does now, whereas a merge will intelligently copy over data (clearly a scraper will be faster, and you will probably want it when you are effectively copying a database). What are your thoughts on maybe keeping our scrapers or not?
I seem to remember giving a Hash to .new
working in the past, but it will be worth checking on. It isn't really that drastic, as looping through the hash to set values is not that painful by any means.
I think that Scraper
's and Merge
's go hand-in-hand. A Scraper
handles the transactions required for gathering data (be that simply over HTTP or otherwise), but the actual merging of that into the database is done via a Merge
. Keeping the Scraper
construct is good.
As with many things, these questions could be asked by working on the spec and filling some stuff out. :wink:
I will check the functionality of passing a Hash to .new
out.
So I think the one thing we both alluded to needing to do pretty soon was defining the rules and variables needed to determine if models are unique or not. How would you like to put work toward that end? Just write how we want to go about it in this issue?
Yes. We can go ahead and identify these things in this Issue.
Let's begin with a rather hard model, then, competition. Obviously having a year integer is a need, but how do we figure out the other information in determining if it is unique? We do have two strings, name
and location
, but both are pretty sketchy to utilize to determine uniqueness. One idea I had was that we could use the teams
method that the competition contains. If a competition is in the same year and has exactly the same teams list as another competition, they are the same. Is that safe enough? Obviously the exact same teams could go to a competition within the same year, but I highly doubt this ever happens. The one fault with this idea, I'd say, is that it relies too heavily on the scouts all accurately reporting every team in a competition (obviously team number is a great way to determine if the teams are the same).
Other ideas?
What we're really talking about here is validation of uniqueness. Teams, for instance, are really easy to validate because we already have a field that is reserved for that—number
. So, we just need to come up with such things. If an object can be added to the database without violating any uniqueness validations, then we should add it. Then, we're just writing uniqueness validators.
Here are my ideas for which fields represent uniqueness:
Competition
should use name
and a nascent year
field as its unique fields.Robot
should uses a nascent year
field as its unique field.Team
already uses number
as its unique field.Match
, Record
, and Participation
are all subordinate models, I think. As long as the outermost models, (Team
, Robot
, and Competition
) are all unique and nice, the internal models should be unique as well.
I think the uniqueness validation is probably what needs to get done, and really only needs to get done on fields that are required. Thus, if we require certain fields, we can also require their uniqueness if they represent an identifying characteristic of the team.
This turns out to be a highly-requested feature from the people I've talked to at competitions about this project.
First off, my apologies if anything I argue here is off from what had been mentioned. It has been a while.
I would not say merging databases is simply writing uniqueness validators. What is the rule for when two objects are deemed to be the same? Do we prompt the user to choose which one to keep? Since every object in our database can be filled with any amount of extra data the end user wants, we cannot simply take it as a given that one database has precedence over the other. I would suggest merging here is like merging in Git with a merge helper: the user is given the option to keep one or the other, or try to merge the two together.
To comment more directly on your uniqueness proposals:
name
with competition because what if people abbreviate names? It seems like there really isn't a good way.Robot
using a combination of year
and the number of the team it is attached to sounds great. It will be bad for if a team has multiple robots in the same year, though. Is that legal?To comment on the subordinate models, as you had said: we still have to deal with special cases relating to these objects. For example, let's say we have data for the same competition taken by two different sets of scouts and we want to merge it. The competition would conflict, but we still want to copy over records that exist in one database but not the other. We may need to change the reference for the competition. A similar thing relates to matches and participations, I'd say. One database could have an incomplete dataset, and through merging we strive to achieve a more complete set of data.
A team should never have multiple robots in the same year, because that's completely illegal.
This is slightly off topic of the thread but since Sam brought it up I'll mention it. There have been cases of robots changing configuration or design between competitions, and these changes could be slight or very drastic. This is obviously an unlikely scenario but we'd have to figure out how to deal with these cases as well.
Okay, the plan for robot uniqueness is all :+1: then. I just don't know anything about FRC.
And yeah, that is somewhat outside of the scope of this issue, but definitely an interesting concern. I assume people would want to make a new Robot object, but here the program probably would not permit it.
There have been cases of robots changing configuration or design between competitions, [...] we'd have to figure out how to deal with these cases as well.
That was precisely the argument for using Participations as the central point where all data about the robot at the particular competition they are attending is stored. Robots do not change much at competitions, and when they do, simply re-pit-scouting them would be acceptable.
One of my points for adding a Year or Season model was that Robot could then become like Participation, but with such a nascent model instead of a Competition. We could use Year/Season to validate the Robot and the Competition that it wants to participate in are of the same game. This would mean that Teams are persistent, Robots connect with Seasons, Participations connect with Competitions, (which would also be dependent on Seasons) and then we could even identify the Record format by attaching it to that Season model as well. This is not the issue to discuss this suggestion, however, but it's food for thought and I'd welcome ideas in other Issues.
In regards to merging dumps, I refer back to my previous idea: treat the extant database as the master, and the merging database as the subordinate. We could also make it such that the merging mechanism asks questions about which competitions to merge with which, and I do agree that this process should be very interactive so as to prevent errors.
That's right, I forgot that we made that change this year as far as the implementation goes. Sorry for derailing, carry on
Yeah, let's make it interactive when there are conflicts. What definition of nascent here are you using by the way? I've seen it a lot and the dictionary isn't helping me.
On the Robot note, I forgot that Participations share details about the Robot at the given Competition. Since that takes place already, I do not think we have anything to fix.
What definition of nascent here are you using by the way?
I was referring to the proposed models. Apologies for confusing wording.
I think adding a Year/Season model was discussed in #80 or something like that. I'd be interested to see if that's a valid idea now.
I was asking what the word meant. You're saying it means proposed?
@Sammidysam "Nascent" means "just coming into existence and just beginning to display signs of future potential." I like fancy words. 😀 Since I was referring to these actions in a conditional or optional state, I figured I could use nascent to describe the Year/Season model since that is what would be just coming into existence with all of these changes.
One other thing that I would mention as a possibility: we could also have FReCon import (by scraping) the database into another Mongo database on the system before merging. This would allow us to deal with associations, but it could make things quite messy.
How does that help us deal with associations versus importing JSON? That we can use the Mongoid relations methods?
@Sammidysam Yes, I believe so. We could also rename all the id
fields to old_id
and then do relations simply by querying by old_id
instead of id
. That's funky too, though.
I don't think we would need to do that if the other data is stored in a fully separate database. I think when we are replacing references, all we really need to do is have a sort of dictionary effectively (sorry that I'm using the Python term) where we map IDs to new IDs under the given models. For example, let's say Competition A conflicts with Competition B. We are replacing references of Competition B with A. All of the Matches then look into the dictionary, see that B goes to A, and replace correspondingly.
That would make sense, and that was the initial suggestion as I recall. It would be much lighter and probably easier to implement.
We have this problem (as brought to my attention by @vmai and @Sammidysam) where it is currently impossible to merge dumps together, given that teams might be redundant across the databases (i.e. if a team participated in multiple competitions). What this effectively means is this: multiple different Teams might be created with the same number, which is incorrect.
My proposed solution is to merge conflicts like this as follows: when multiple Teams are trying to be created (if there are duplicates), use the one that's already in the database. Take all Participations that reference the wrong one, and make them reference the Team that's already in the database. This is ugly, because you have to find all participations with a dangling reference to something that doesn't exist, but it just makes sense.
Also, a point was made about conflicting matches, but I don't think we need to do anything with those, since they need to be unique across competitions.